Understanding Mipha%3A A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

This paper explores the design aspects of Multimodal Small Language Models (MSLMs) and introduces Mipha, an efficient multimodal assistant. Mipha is designed to integrate visual representation, language models, and optimization strategies synergistically. The authors find that Mipha-3B outperforms state-of-the-art large Multimodal Large Language Models (MLLMs) on multiple benchmarks without increasing training data. Key findings include: 1. **Visual Resolution**: Increasing image resolution is not always beneficial; models with 224px resolution outperform those with 448px resolution. 2. **Fine-Tuning**: Fine-tuning both the visual backbone and language model is crucial for MSLMs, contrary to previous findings. 3. **Instruction Tuning**: Instruction tuning is not essential; models with base language models perform well. 4. **Optimization Strategies**: LoRA (Low-Rank Adaptation) is effective and efficient, achieving similar performance to full-parameter tuning. The paper provides insights and guidelines for developing strong MSLMs and demonstrates that Mipha-3B can achieve competitive performance with fewer parameters, making it a significant advancement in the field.This paper explores the design aspects of Multimodal Small Language Models (MSLMs) and introduces Mipha, an efficient multimodal assistant. Mipha is designed to integrate visual representation, language models, and optimization strategies synergistically. The authors find that Mipha-3B outperforms state-of-the-art large Multimodal Large Language Models (MLLMs) on multiple benchmarks without increasing training data. Key findings include: 1. **Visual Resolution**: Increasing image resolution is not always beneficial; models with 224px resolution outperform those with 448px resolution. 2. **Fine-Tuning**: Fine-tuning both the visual backbone and language model is crucial for MSLMs, contrary to previous findings. 3. **Instruction Tuning**: Instruction tuning is not essential; models with base language models perform well. 4. **Optimization Strategies**: LoRA (Low-Rank Adaptation) is effective and efficient, achieving similar performance to full-parameter tuning. The paper provides insights and guidelines for developing strong MSLMs and demonstrates that Mipha-3B can achieve competitive performance with fewer parameters, making it a significant advancement in the field.

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

25 Mar 2024 | Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang