Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

14 Mar 2024 | Yufei Zhan¹,², Yousong Zhu¹, Hongyin Zhao¹, Fan Yang¹,²,³, Ming Tang¹,², and Jinqiao Wang¹,²,³,⁴
Griffon v2 is a high-resolution multimodal model that enhances visual and language referring capabilities. It addresses the challenge of image resolution limitations in large vision language models (LVLMs), which hinder performance in complex and dense scenarios. Griffon v2 introduces a unified high-resolution generalist model that enables flexible object referring using visual and textual prompts. To efficiently scale image resolution, it employs a lightweight downsampling projector, preserving visual details and improving multimodal perception, especially for small objects. The model also supports visual-language co-referring through a plug-and-play visual tokenizer, enabling user-friendly interaction with flexible target images, free-form texts, and coordinates. Experiments show that Griffon v2 can accurately localize objects, achieving state-of-the-art performance in tasks like Referring Expression Comprehension (REC), phrase grounding, and Referring Expression Generation (REG), outperforming expert models in object detection and counting. The model is trained on 12M localization data and 900K instruction data, demonstrating superior performance in various tasks. Griffon v2's high-resolution structure directly extracts visual features and projects them into text embeddings, enabling precise object localization and description. The model supports high-resolution inputs up to 1K without image division, enhancing its applicability in diverse scenarios. It also provides various interactive abilities, mitigating the limitations of singular visual or language prompts. The model's design is efficient and computationally concise, making it suitable for large-scale pretraining. Overall, Griffon v2 advances multimodal perception with high-resolution scaling and visual-language co-referring, offering improved performance in complex and dense scenarios.Griffon v2 is a high-resolution multimodal model that enhances visual and language referring capabilities. It addresses the challenge of image resolution limitations in large vision language models (LVLMs), which hinder performance in complex and dense scenarios. Griffon v2 introduces a unified high-resolution generalist model that enables flexible object referring using visual and textual prompts. To efficiently scale image resolution, it employs a lightweight downsampling projector, preserving visual details and improving multimodal perception, especially for small objects. The model also supports visual-language co-referring through a plug-and-play visual tokenizer, enabling user-friendly interaction with flexible target images, free-form texts, and coordinates. Experiments show that Griffon v2 can accurately localize objects, achieving state-of-the-art performance in tasks like Referring Expression Comprehension (REC), phrase grounding, and Referring Expression Generation (REG), outperforming expert models in object detection and counting. The model is trained on 12M localization data and 900K instruction data, demonstrating superior performance in various tasks. Griffon v2's high-resolution structure directly extracts visual features and projects them into text embeddings, enabling precise object localization and description. The model supports high-resolution inputs up to 1K without image division, enhancing its applicability in diverse scenarios. It also provides various interactive abilities, mitigating the limitations of singular visual or language prompts. The model's design is efficient and computationally concise, making it suitable for large-scale pretraining. Overall, Griffon v2 advances multimodal perception with high-resolution scaling and visual-language co-referring, offering improved performance in complex and dense scenarios.
Reach us at info@study.space
[slides and audio] Griffon v2%3A Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring