Understanding Mini-Gemini%3A Mining the Potential of Multi-modality Vision Language Models

The article introduces Mini-Gemini, a streamlined and effective framework for enhancing multi-modality Vision Language Models (VLMs). Despite advancements in VLMs, there remains a performance gap compared to models like GPT-4 and Gemini. To bridge this gap, Mini-Gemini focuses on three key aspects: high-resolution visual tokens, high-quality data, and VLM-guided generation. It improves visual tokens by using an additional visual encoder for high-resolution refinement without increasing the token count. A high-quality dataset is also constructed to enhance image comprehension and reasoning-based generation. Mini-Gemini supports a range of large language models (LLMs) from 2B to 34B parameters and achieves leading performance in zero-shot benchmarks, even surpassing private models. The framework employs an any-to-any workflow, enabling simultaneous image and text generation. It utilizes dual vision encoders for low and high-resolution processing, with patch info mining to extract detailed visual cues. The model also leverages high-quality data and instruction tuning to improve performance. Experiments show that Mini-Gemini outperforms existing methods in various benchmarks, demonstrating its potential to set new benchmarks in VLMs. The framework is designed to be efficient, scalable, and adaptable, making it a valuable tool for advancing multi-modal capabilities in vision language models.The article introduces Mini-Gemini, a streamlined and effective framework for enhancing multi-modality Vision Language Models (VLMs). Despite advancements in VLMs, there remains a performance gap compared to models like GPT-4 and Gemini. To bridge this gap, Mini-Gemini focuses on three key aspects: high-resolution visual tokens, high-quality data, and VLM-guided generation. It improves visual tokens by using an additional visual encoder for high-resolution refinement without increasing the token count. A high-quality dataset is also constructed to enhance image comprehension and reasoning-based generation. Mini-Gemini supports a range of large language models (LLMs) from 2B to 34B parameters and achieves leading performance in zero-shot benchmarks, even surpassing private models. The framework employs an any-to-any workflow, enabling simultaneous image and text generation. It utilizes dual vision encoders for low and high-resolution processing, with patch info mining to extract detailed visual cues. The model also leverages high-quality data and instruction tuning to improve performance. Experiments show that Mini-Gemini outperforms existing methods in various benchmarks, demonstrating its potential to set new benchmarks in VLMs. The framework is designed to be efficient, scalable, and adaptable, making it a valuable tool for advancing multi-modal capabilities in vision language models.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

27 Mar 2024 | Yanwei Li1, Yuechen Zhang1, Chengyao Wang1*, Zhisheng Zhong1 Yixin Chen1 Ruihang Chu1 Shaoteng Liu1 Jiaya Jia1,2

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

27 Mar 2024 | Yanwei Li1*, Yuechen Zhang1*, Chengyao Wang1*, Zhisheng Zhong1 Yixin Chen1 Ruihang Chu1 Shaoteng Liu1 Jiaya Jia1,2

27 Mar 2024 | Yanwei Li1, Yuechen Zhang1, Chengyao Wang1*, Zhisheng Zhong1 Yixin Chen1 Ruihang Chu1 Shaoteng Liu1 Jiaya Jia1,2