[slides and audio] Ferret-v2%3A An Improved Baseline for Referring and Grounding with Large Language Models

Ferret-v2 is an advanced model designed to enhance the capabilities of large language models (LLMs) in referring and grounding tasks. It addresses the limitations of its predecessor, Ferret, by introducing three key improvements: (1) Any Resolution Grounding and Referring, which allows the model to handle higher image resolutions and process images with greater detail; (2) Multi-Granularity Visual Encoding, which integrates a DINOv2 encoder to capture both global and fine-grained visual information; and (3) A Three-Stage Training Paradigm, which includes an additional stage for high-resolution dense alignment before final instruction tuning. The paper demonstrates that Ferret-v2 outperforms Ferret and other state-of-the-art methods in various tasks, including referring and grounding, visual question answering, and modern MLLM benchmarks. The model's superior performance is attributed to its ability to handle higher-resolution images, its multi-granularity visual encoding, and its refined training process. The authors also provide a detailed analysis of the model's architecture, training methods, and ablation studies to support their claims.Ferret-v2 is an advanced model designed to enhance the capabilities of large language models (LLMs) in referring and grounding tasks. It addresses the limitations of its predecessor, Ferret, by introducing three key improvements: (1) Any Resolution Grounding and Referring, which allows the model to handle higher image resolutions and process images with greater detail; (2) Multi-Granularity Visual Encoding, which integrates a DINOv2 encoder to capture both global and fine-grained visual information; and (3) A Three-Stage Training Paradigm, which includes an additional stage for high-resolution dense alignment before final instruction tuning. The paper demonstrates that Ferret-v2 outperforms Ferret and other state-of-the-art methods in various tasks, including referring and grounding, visual question answering, and modern MLLM benchmarks. The model's superior performance is attributed to its ability to handle higher-resolution images, its multi-granularity visual encoding, and its refined training process. The authors also provide a detailed analysis of the model's architecture, training methods, and ablation studies to support their claims.

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

2024 | Haotian Zhang, Haoxuan You, Philipp Dutter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang