2024 | Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfey Yang
Ferret-v2 is an improved baseline for referring and grounding with large language models (LLMs). It introduces three key design improvements: (1) any resolution grounding and referring, enabling the model to handle higher image resolutions and better process detailed images; (2) multi-granularity visual encoding, which integrates an additional DINOv2 encoder to learn better and diverse underlying contexts for global and fine-grained visual information; and (3) a three-stage training paradigm, which includes image-caption alignment, high-resolution dense alignment, and instruction tuning. These enhancements allow Ferret-v2 to significantly outperform Ferret and other state-of-the-art methods in tasks requiring detailed visual understanding. The model is trained on a wide range of tasks, including referring and grounding, visual question answering, and modern MLLM benchmarks, demonstrating its superior performance. Ferret-v2's ability to handle high-resolution images and fine-grained visual processing makes it more effective in tasks that require detailed visual comprehension. The model also shows strong performance in visual grounding tasks, outperforming many existing models. Overall, Ferret-v2 represents a significant advancement in the field of referring and grounding with large language models.Ferret-v2 is an improved baseline for referring and grounding with large language models (LLMs). It introduces three key design improvements: (1) any resolution grounding and referring, enabling the model to handle higher image resolutions and better process detailed images; (2) multi-granularity visual encoding, which integrates an additional DINOv2 encoder to learn better and diverse underlying contexts for global and fine-grained visual information; and (3) a three-stage training paradigm, which includes image-caption alignment, high-resolution dense alignment, and instruction tuning. These enhancements allow Ferret-v2 to significantly outperform Ferret and other state-of-the-art methods in tasks requiring detailed visual understanding. The model is trained on a wide range of tasks, including referring and grounding, visual question answering, and modern MLLM benchmarks, demonstrating its superior performance. Ferret-v2's ability to handle high-resolution images and fine-grained visual processing makes it more effective in tasks that require detailed visual comprehension. The model also shows strong performance in visual grounding tasks, outperforming many existing models. Overall, Ferret-v2 represents a significant advancement in the field of referring and grounding with large language models.