MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

18 Apr 2024 | Brandon McKenzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruvi Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, and Yinfei Yang
This paper presents MM1, a family of multimodal large language models (MLLMs) that achieve state-of-the-art (SOTA) performance in pre-training metrics and competitive performance after supervised fine-tuning on established multimodal benchmarks. The authors investigate the importance of various architecture components and data choices in building performant MLLMs. Through comprehensive ablations of the image encoder, vision language connector, and pre-training data choices, they identify several crucial design lessons. For example, they demonstrate that a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving SOTA few-shot results across multiple benchmarks. They also show that the image encoder, image resolution, and image token count have substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, they build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. The authors also explore the impact of different pre-training data choices, finding that interleaved data is instrumental for few-shot and text-only performance, while captioning data lifts zero-shot performance. They also find that synthetic data helps with few-shot learning. The authors also explore the impact of different model architectures, finding that MoE models achieve uniformly better performance than dense counterparts on almost every benchmark. Finally, they evaluate the performance of MM1 after supervised fine-tuning, finding that it outperforms other models on several benchmarks, including VQAv2, TextVQA, and SEED. The authors conclude that the lessons learned from their ablation studies are valuable for building strong models beyond any single specific model architecture or data strategy.This paper presents MM1, a family of multimodal large language models (MLLMs) that achieve state-of-the-art (SOTA) performance in pre-training metrics and competitive performance after supervised fine-tuning on established multimodal benchmarks. The authors investigate the importance of various architecture components and data choices in building performant MLLMs. Through comprehensive ablations of the image encoder, vision language connector, and pre-training data choices, they identify several crucial design lessons. For example, they demonstrate that a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving SOTA few-shot results across multiple benchmarks. They also show that the image encoder, image resolution, and image token count have substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, they build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. The authors also explore the impact of different pre-training data choices, finding that interleaved data is instrumental for few-shot and text-only performance, while captioning data lifts zero-shot performance. They also find that synthetic data helps with few-shot learning. The authors also explore the impact of different model architectures, finding that MoE models achieve uniformly better performance than dense counterparts on almost every benchmark. Finally, they evaluate the performance of MM1 after supervised fine-tuning, finding that it outperforms other models on several benchmarks, including VQAv2, TextVQA, and SEED. The authors conclude that the lessons learned from their ablation studies are valuable for building strong models beyond any single specific model architecture or data strategy.
Reach us at info@study.space