What matters when building vision-language models?

What matters when building vision-language models?

3 May 2024 | Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh
Building vision-language models (VLMs) involves critical design decisions that significantly impact performance. This paper investigates key factors in VLM design through extensive experiments, including pre-trained models, architecture choices, data, and training methods. The study introduces Idefics2, an efficient 8-billion-parameter VLM that achieves state-of-the-art performance on various benchmarks, often matching larger models. The research highlights that pre-trained unimodal backbones are crucial for VLM performance, and the fully autoregressive architecture outperforms cross-attention in training stability and performance, though it requires careful optimization. Efficient inference is achieved through techniques like learned pooling, which reduces visual tokens without compromising performance. Preserving the original aspect ratio and resolution of images during training also enhances performance without increasing computational costs. The study also shows that splitting images into sub-images can improve performance during inference, and that training on diverse data sources like PDF documents enhances OCR and document understanding capabilities. Idefics2 is trained on a large-scale dataset and fine-tuned for instruction-based tasks, achieving strong performance on benchmarks like VQA, TextVQA, and COCO. The model is also optimized for chat scenarios, balancing efficiency and performance. The findings contribute to the ongoing development of VLMs, providing insights into design choices that improve performance and efficiency.Building vision-language models (VLMs) involves critical design decisions that significantly impact performance. This paper investigates key factors in VLM design through extensive experiments, including pre-trained models, architecture choices, data, and training methods. The study introduces Idefics2, an efficient 8-billion-parameter VLM that achieves state-of-the-art performance on various benchmarks, often matching larger models. The research highlights that pre-trained unimodal backbones are crucial for VLM performance, and the fully autoregressive architecture outperforms cross-attention in training stability and performance, though it requires careful optimization. Efficient inference is achieved through techniques like learned pooling, which reduces visual tokens without compromising performance. Preserving the original aspect ratio and resolution of images during training also enhances performance without increasing computational costs. The study also shows that splitting images into sub-images can improve performance during inference, and that training on diverse data sources like PDF documents enhances OCR and document understanding capabilities. Idefics2 is trained on a large-scale dataset and fine-tuned for instruction-based tasks, achieving strong performance on benchmarks like VQA, TextVQA, and COCO. The model is also optimized for chat scenarios, balancing efficiency and performance. The findings contribute to the ongoing development of VLMs, providing insights into design choices that improve performance and efficiency.
Reach us at info@study.space
Understanding What matters when building vision-language models%3F