18 Apr 2024 | Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dutter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
This paper discusses the development of Multimodal Large Language Models (MLLMs), focusing on the importance of various architectural components and data choices. The authors conduct comprehensive ablations to identify crucial design lessons, such as the impact of image resolution, visual encoder loss, and data mixture. They demonstrate that a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results. The image encoder and image resolution have a significant impact, while the vision-language connector design is less important. By scaling up the model, they build a family of MLLMs, including dense variants up to 3B parameters and mixture-of-experts (MoE) variants up to 64B parameters, which achieve SOTA performance in pre-training metrics and competitive performance after supervised fine-tuning on established benchmarks. The final models, MM1, exhibit appealing properties such as enhanced in-context learning and multi-image reasoning, enabling strong few-shot learning capabilities. The paper provides detailed insights into the building process of MLLMs, aiming to help the community develop robust models beyond specific architectures or data strategies.This paper discusses the development of Multimodal Large Language Models (MLLMs), focusing on the importance of various architectural components and data choices. The authors conduct comprehensive ablations to identify crucial design lessons, such as the impact of image resolution, visual encoder loss, and data mixture. They demonstrate that a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results. The image encoder and image resolution have a significant impact, while the vision-language connector design is less important. By scaling up the model, they build a family of MLLMs, including dense variants up to 3B parameters and mixture-of-experts (MoE) variants up to 64B parameters, which achieve SOTA performance in pre-training metrics and competitive performance after supervised fine-tuning on established benchmarks. The final models, MM1, exhibit appealing properties such as enhanced in-context learning and multi-image reasoning, enabling strong few-shot learning capabilities. The paper provides detailed insights into the building process of MLLMs, aiming to help the community develop robust models beyond specific architectures or data strategies.