[slides and audio] MM1%3A Methods%2C Analysis %26 Insights from Multimodal LLM Pre-training

This paper discusses the development of Multimodal Large Language Models (MLLMs), focusing on the importance of various architectural components and data choices. The authors conduct comprehensive ablations to identify crucial design lessons, such as the impact of image resolution, visual encoder loss, and data mixture. They demonstrate that a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results. The image encoder and image resolution have a significant impact, while the vision-language connector design is less important. By scaling up the model, they build a family of MLLMs, including dense variants up to 3B parameters and mixture-of-experts (MoE) variants up to 64B parameters, which achieve SOTA performance in pre-training metrics and competitive performance after supervised fine-tuning on established benchmarks. The final models, MM1, exhibit appealing properties such as enhanced in-context learning and multi-image reasoning, enabling strong few-shot learning capabilities. The paper provides detailed insights into the building process of MLLMs, aiming to help the community develop robust models beyond specific architectures or data strategies.This paper discusses the development of Multimodal Large Language Models (MLLMs), focusing on the importance of various architectural components and data choices. The authors conduct comprehensive ablations to identify crucial design lessons, such as the impact of image resolution, visual encoder loss, and data mixture. They demonstrate that a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results. The image encoder and image resolution have a significant impact, while the vision-language connector design is less important. By scaling up the model, they build a family of MLLMs, including dense variants up to 3B parameters and mixture-of-experts (MoE) variants up to 64B parameters, which achieve SOTA performance in pre-training metrics and competitive performance after supervised fine-tuning on established benchmarks. The final models, MM1, exhibit appealing properties such as enhanced in-context learning and multi-image reasoning, enabling strong few-shot learning capabilities. The paper provides detailed insights into the building process of MLLMs, aiming to help the community develop robust models beyond specific architectures or data strategies.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training