CoLLaVO: Crayon Large Language and Vision mOdel

CoLLaVO: Crayon Large Language and Vision mOdel

2 Jun 2024 | Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
The paper introduces CoLLaVO, a new large language and vision model that enhances object-level image understanding through the use of Crayon Prompt and Dual QLoRA. The authors find that current Vision Language Models (VLMs) lack robust object-level image understanding, which significantly impacts their zero-shot performance on Vision Language (VL) tasks. To address this, CoLLaVO incorporates a Crayon Prompt, which uses panoptic color maps to guide the model's attention to specific objects, and Dual QLoRA, a learning strategy that preserves object-level understanding while improving complex VL performance. The results show that CoLLaVO outperforms existing VLMs in both object-level image understanding and zero-shot VL tasks, demonstrating the importance of foundational image understanding in VL performance. The paper also discusses the limitations and future directions for improving visual prompts and object-level image understanding.The paper introduces CoLLaVO, a new large language and vision model that enhances object-level image understanding through the use of Crayon Prompt and Dual QLoRA. The authors find that current Vision Language Models (VLMs) lack robust object-level image understanding, which significantly impacts their zero-shot performance on Vision Language (VL) tasks. To address this, CoLLaVO incorporates a Crayon Prompt, which uses panoptic color maps to guide the model's attention to specific objects, and Dual QLoRA, a learning strategy that preserves object-level understanding while improving complex VL performance. The results show that CoLLaVO outperforms existing VLMs in both object-level image understanding and zero-shot VL tasks, demonstrating the importance of foundational image understanding in VL performance. The paper also discusses the limitations and future directions for improving visual prompts and object-level image understanding.
Reach us at info@study.space
[slides] CoLLaVO%3A Crayon Large Language and Vision mOdel | StudySpace