Efficient Multimodal Learning from Data-centric Perspective

Efficient Multimodal Learning from Data-centric Perspective

22 Jul 2024 | Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao
This paper introduces *Bunny*, a family of lightweight Multimodal Large Language Models (MLLMs) designed to address the computational costs and performance limitations of large MLLMs. The authors demonstrate that training smaller MLLMs with high-quality, curated training data can achieve superior performance compared to their larger counterparts. *Bunny* offers flexible combinations of vision encoders and language backbones, including lightweight options like Phi-1.5, Qwen1.5-1.8B, StableLM-2, MiniCPM-2B, Phi-2, Phi-3-Mini, and Llama-3-8B. The training data is constructed by selecting informative samples from a broader source, such as LAION-2B, through a three-step coreset selection process. Experiments show that *Bunny-4B/8B* outperform state-of-the-art large MLLMs on multiple benchmarks, including MME perception, MME cognition, MMBench, SEED-Bench-1, MMMU, VQA-v2, GQA, ScienceQA-IMG, and POPE. The paper also includes ablation studies on various training techniques, such as LoRA, fine-tuning data, learning rates, and image resolution scaling, to optimize the model's performance. Overall, *Bunny* provides a clean and flexible open-source tool for further research and development in multimodal learning.This paper introduces *Bunny*, a family of lightweight Multimodal Large Language Models (MLLMs) designed to address the computational costs and performance limitations of large MLLMs. The authors demonstrate that training smaller MLLMs with high-quality, curated training data can achieve superior performance compared to their larger counterparts. *Bunny* offers flexible combinations of vision encoders and language backbones, including lightweight options like Phi-1.5, Qwen1.5-1.8B, StableLM-2, MiniCPM-2B, Phi-2, Phi-3-Mini, and Llama-3-8B. The training data is constructed by selecting informative samples from a broader source, such as LAION-2B, through a three-step coreset selection process. Experiments show that *Bunny-4B/8B* outperform state-of-the-art large MLLMs on multiple benchmarks, including MME perception, MME cognition, MMBench, SEED-Bench-1, MMMU, VQA-v2, GQA, ScienceQA-IMG, and POPE. The paper also includes ablation studies on various training techniques, such as LoRA, fine-tuning data, learning rates, and image resolution scaling, to optimize the model's performance. Overall, *Bunny* provides a clean and flexible open-source tool for further research and development in multimodal learning.
Reach us at info@study.space