Efficient Multimodal Learning from Data-centric Perspective

Efficient Multimodal Learning from Data-centric Perspective

22 Jul 2024 | Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao
This paper introduces Bunny, a family of lightweight multimodal large language models (MLLMs) that achieve high performance using high-quality training data. The models are designed with flexible vision and language backbones, enabling efficient multimodal learning. The authors demonstrate that by leveraging data optimization techniques, such as dataset condensation, they can train smaller MLLMs that outperform state-of-the-art large MLLMs on multiple benchmarks. Bunny-4B and Bunny-8B models are shown to outperform existing models in tasks such as visual question answering, reasoning, and understanding. The models are built using a combination of lightweight language and vision backbones, along with a cross-modality projector. The training data is carefully curated to ensure high quality and diversity, and the models are trained using a two-stage process: pre-training and fine-tuning. The authors also explore various ablation studies, including the impact of different training data, learning rates, and model configurations. The results show that the Bunny models achieve strong performance on a range of benchmarks, including MME, MMBench, SEED-Bench-1, MMMU, VQA-v2, GQA, ScienceQA-IMG, and POPE. The models are also shown to handle high-resolution images and exhibit strong performance in Chinese instruction-following tasks. The paper concludes that Bunny provides a flexible and efficient solution for multimodal learning, with the potential to be a valuable tool for further research and development in the field. The code, models, and data are available on the official GitHub repository.This paper introduces Bunny, a family of lightweight multimodal large language models (MLLMs) that achieve high performance using high-quality training data. The models are designed with flexible vision and language backbones, enabling efficient multimodal learning. The authors demonstrate that by leveraging data optimization techniques, such as dataset condensation, they can train smaller MLLMs that outperform state-of-the-art large MLLMs on multiple benchmarks. Bunny-4B and Bunny-8B models are shown to outperform existing models in tasks such as visual question answering, reasoning, and understanding. The models are built using a combination of lightweight language and vision backbones, along with a cross-modality projector. The training data is carefully curated to ensure high quality and diversity, and the models are trained using a two-stage process: pre-training and fine-tuning. The authors also explore various ablation studies, including the impact of different training data, learning rates, and model configurations. The results show that the Bunny models achieve strong performance on a range of benchmarks, including MME, MMBench, SEED-Bench-1, MMMU, VQA-v2, GQA, ScienceQA-IMG, and POPE. The models are also shown to handle high-resolution images and exhibit strong performance in Chinese instruction-following tasks. The paper concludes that Bunny provides a flexible and efficient solution for multimodal learning, with the potential to be a valuable tool for further research and development in the field. The code, models, and data are available on the official GitHub repository.
Reach us at info@study.space
[slides] Efficient Multimodal Learning from Data-centric Perspective | StudySpace