14 Mar 2024 | Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
Griffon v2 is a high-resolution multimodal model designed to enhance object perception and visual-language co-referring capabilities. The model addresses the limitation of image resolution in large vision language models (LVLMs), which hinders their performance in complex and dense scenarios. To overcome this, Griffon v2 employs a high-resolution visual encoder and a lightweight down-sampling projector to efficiently scale up image resolution without dividing the image into smaller patches. This approach preserves fine details and context, improving multimodal perception, especially for small objects.
The model also introduces a visual-language co-referring mechanism, enabling users to refer to objects using visual prompts (e.g., screenshots, cross-image modes) and textual descriptions. This feature enhances user interaction by supporting flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can accurately localize objects of interest and generate descriptions with co-referring across various scenarios, achieving state-of-the-art performance on tasks such as Referring Expression Comprehension (REC), phrase grounding, and Referring Expression Generation (REG). Additionally, the model outperforms expert models in object detection and counting tasks, marking the first time LVLMs have achieved expert-level performance in these areas.
The paper includes a detailed overview of the model's architecture, training pipeline, and experimental results, showcasing its effectiveness in handling high-resolution inputs and providing accurate visual-language co-referring. The authors also conduct ablation studies to validate the contributions of different components and present qualitative analysis to illustrate the model's performance in various tasks.Griffon v2 is a high-resolution multimodal model designed to enhance object perception and visual-language co-referring capabilities. The model addresses the limitation of image resolution in large vision language models (LVLMs), which hinders their performance in complex and dense scenarios. To overcome this, Griffon v2 employs a high-resolution visual encoder and a lightweight down-sampling projector to efficiently scale up image resolution without dividing the image into smaller patches. This approach preserves fine details and context, improving multimodal perception, especially for small objects.
The model also introduces a visual-language co-referring mechanism, enabling users to refer to objects using visual prompts (e.g., screenshots, cross-image modes) and textual descriptions. This feature enhances user interaction by supporting flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can accurately localize objects of interest and generate descriptions with co-referring across various scenarios, achieving state-of-the-art performance on tasks such as Referring Expression Comprehension (REC), phrase grounding, and Referring Expression Generation (REG). Additionally, the model outperforms expert models in object detection and counting tasks, marking the first time LVLMs have achieved expert-level performance in these areas.
The paper includes a detailed overview of the model's architecture, training pipeline, and experimental results, showcasing its effectiveness in handling high-resolution inputs and providing accurate visual-language co-referring. The authors also conduct ablation studies to validate the contributions of different components and present qualitative analysis to illustrate the model's performance in various tasks.