2 Jun 2024 | Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
CoLLaVO is a large language and vision model that enhances object-level image understanding through a novel visual prompt called Crayon Prompt, which leverages panoptic color maps to provide semantic and numbering information for objects in images. The model also incorporates Dual QLoRA, a learning strategy that preserves object-level understanding while training on visual instruction tuning datasets. This approach significantly improves zero-shot performance on vision language tasks. The Crayon Prompt is integrated into the model's image embedding features at every attention layer, allowing the model to maintain the raw visual context of the image. The model is evaluated on various vision language tasks and outperforms several closed-source and open-source VLMs. The results show that object-level image understanding is strongly correlated with zero-shot performance on vision language tasks, highlighting the importance of this capability for VLMs. CoLLaVO achieves state-of-the-art performance on multiple vision language benchmarks, demonstrating its effectiveness in enhancing object-level image understanding and zero-shot performance.CoLLaVO is a large language and vision model that enhances object-level image understanding through a novel visual prompt called Crayon Prompt, which leverages panoptic color maps to provide semantic and numbering information for objects in images. The model also incorporates Dual QLoRA, a learning strategy that preserves object-level understanding while training on visual instruction tuning datasets. This approach significantly improves zero-shot performance on vision language tasks. The Crayon Prompt is integrated into the model's image embedding features at every attention layer, allowing the model to maintain the raw visual context of the image. The model is evaluated on various vision language tasks and outperforms several closed-source and open-source VLMs. The results show that object-level image understanding is strongly correlated with zero-shot performance on vision language tasks, highlighting the importance of this capability for VLMs. CoLLaVO achieves state-of-the-art performance on multiple vision language benchmarks, demonstrating its effectiveness in enhancing object-level image understanding and zero-shot performance.