[slides and audio] What matters when building vision-language models%3F

The paper "What Matters When Building Vision-Language Models?" by Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh explores critical decisions in the design of vision-language models (VLMs) and their impact on performance. Despite the growing interest and extensive literature, many design choices are not experimentally justified, hindering progress. The authors conduct extensive experiments to address this issue, focusing on pre-trained models, architecture choices, data, and training methods. They develop Idefics2, an 8 billion parameter foundational VLM, which achieves state-of-the-art performance across various benchmarks and is often on par with models four times its size. Key findings include the importance of pre-trained unimodal backbones, the superiority of fully autoregressive over cross-attention architectures, and the benefits of learned pooling and preserving image aspect ratios. The paper also discusses the trade-offs between compute efficiency and downstream performance, and provides insights into optimizing for chat scenarios. The release of Idefics2 and its datasets aims to contribute to the ongoing evolution of VLMs.The paper "What Matters When Building Vision-Language Models?" by Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh explores critical decisions in the design of vision-language models (VLMs) and their impact on performance. Despite the growing interest and extensive literature, many design choices are not experimentally justified, hindering progress. The authors conduct extensive experiments to address this issue, focusing on pre-trained models, architecture choices, data, and training methods. They develop Idefics2, an 8 billion parameter foundational VLM, which achieves state-of-the-art performance across various benchmarks and is often on par with models four times its size. Key findings include the importance of pre-trained unimodal backbones, the superiority of fully autoregressive over cross-attention architectures, and the benefits of learned pooling and preserving image aspect ratios. The paper also discusses the trade-offs between compute efficiency and downstream performance, and provides insights into optimizing for chat scenarios. The release of Idefics2 and its datasets aims to contribute to the ongoing evolution of VLMs.

What matters when building vision-language models?

3 May 2024 | Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh