Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

27 Mar 2024 | Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
Mini-Gemini is a simple and effective framework that enhances multi-modal Vision Language Models (VLMs). Despite advancements in VLMs, there is still a performance gap compared to models like GPT-4 and Gemini. To bridge this gap, Mini-Gemini focuses on three aspects: high-resolution visual tokens, high-quality data, and VLM-guided generation. It improves visual tokens by using an additional visual encoder for high-resolution refinement without increasing the token count. A high-quality dataset is also constructed to enhance image comprehension and reasoning-based generation. Mini-Gemini supports a range of large language models from 2B to 34B parameters and achieves leading performance in several zero-shot benchmarks, even surpassing private models. The framework enables simultaneous image understanding, reasoning, and generation. It also supports any-to-any workflows with dense and MoE large language models. Mini-Gemini's dual vision encoders process both text and image inputs, with the high-resolution encoder providing candidate keys and values for reference. The framework also includes patch info mining to extract high-resolution details without increasing the visual token count. Mini-Gemini's performance is validated through extensive experiments on various benchmarks, demonstrating its effectiveness in image understanding and generation. The framework is designed to be efficient and scalable, with a focus on balancing detail richness and computational feasibility. It also supports reasoning-based generation and image generation with high-quality text prompts. Mini-Gemini's capabilities are further demonstrated through qualitative results, showing its ability to handle complex visual tasks and generate high-quality images. The framework is expected to serve as a strong benchmark for image understanding and VLM-guided generation.Mini-Gemini is a simple and effective framework that enhances multi-modal Vision Language Models (VLMs). Despite advancements in VLMs, there is still a performance gap compared to models like GPT-4 and Gemini. To bridge this gap, Mini-Gemini focuses on three aspects: high-resolution visual tokens, high-quality data, and VLM-guided generation. It improves visual tokens by using an additional visual encoder for high-resolution refinement without increasing the token count. A high-quality dataset is also constructed to enhance image comprehension and reasoning-based generation. Mini-Gemini supports a range of large language models from 2B to 34B parameters and achieves leading performance in several zero-shot benchmarks, even surpassing private models. The framework enables simultaneous image understanding, reasoning, and generation. It also supports any-to-any workflows with dense and MoE large language models. Mini-Gemini's dual vision encoders process both text and image inputs, with the high-resolution encoder providing candidate keys and values for reference. The framework also includes patch info mining to extract high-resolution details without increasing the visual token count. Mini-Gemini's performance is validated through extensive experiments on various benchmarks, demonstrating its effectiveness in image understanding and generation. The framework is designed to be efficient and scalable, with a focus on balancing detail richness and computational feasibility. It also supports reasoning-based generation and image generation with high-quality text prompts. Mini-Gemini's capabilities are further demonstrated through qualitative results, showing its ability to handle complex visual tasks and generate high-quality images. The framework is expected to serve as a strong benchmark for image understanding and VLM-guided generation.
Reach us at info@study.space