Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

25 Mar 2024 | Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang
This paper presents Mipha, a comprehensive multimodal assistant built on Small Language Models (SLMs), designed to achieve competitive performance with state-of-the-art Large Multimodal Language Models (LLM) without requiring additional training data. The authors investigate the design aspects of Multimodal Small Language Models (MSLMs) and demonstrate that Mipha-3B outperforms leading open-source MLLMs such as LLaVA-1.5 and Qwen-VL on multiple benchmarks. The study explores three key design spaces of MSLMs: visual representation, language model, and optimization strategy. The findings reveal that increasing image resolution is not always beneficial, and that fine-tuning both the visual backbone and language model is crucial for MSLMs. Additionally, instruction tuning is not essential for MSLMs, and that parameter-efficient fine-tuning methods like LoRA can be effective. The paper also highlights the importance of choosing an appropriate visual representation backbone and the benefits of using a pre-trained small language model. The results show that Mipha-3B achieves superior performance across a majority of benchmarks compared to 7B MLLMs, and in some cases even outperforms 13B MLLMs. The study provides insights and guidelines for developing strong MSLMs that can rival the capabilities of MLLMs. The code is available at https://github.com/zhuyiche/llava-phi.This paper presents Mipha, a comprehensive multimodal assistant built on Small Language Models (SLMs), designed to achieve competitive performance with state-of-the-art Large Multimodal Language Models (LLM) without requiring additional training data. The authors investigate the design aspects of Multimodal Small Language Models (MSLMs) and demonstrate that Mipha-3B outperforms leading open-source MLLMs such as LLaVA-1.5 and Qwen-VL on multiple benchmarks. The study explores three key design spaces of MSLMs: visual representation, language model, and optimization strategy. The findings reveal that increasing image resolution is not always beneficial, and that fine-tuning both the visual backbone and language model is crucial for MSLMs. Additionally, instruction tuning is not essential for MSLMs, and that parameter-efficient fine-tuning methods like LoRA can be effective. The paper also highlights the importance of choosing an appropriate visual representation backbone and the benefits of using a pre-trained small language model. The results show that Mipha-3B achieves superior performance across a majority of benchmarks compared to 7B MLLMs, and in some cases even outperforms 13B MLLMs. The study provides insights and guidelines for developing strong MSLMs that can rival the capabilities of MLLMs. The code is available at https://github.com/zhuyiche/llava-phi.
Reach us at info@study.space
Understanding Mipha%3A A Comprehensive Overhaul of Multimodal Assistant with Small Language Models