27 May 2024 | Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Manás, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie, Pietro Astolfi, Reyhane Askari Hemmat, Jun Chen, Kushal Tirumala, Rim Assouef, Mazda Moayeri, Arjang Talatoff, Kamalika Chaudhuri, Zechun Liu, Xilun Chen, Quentin Garrido, Karen Ullrich, Aishwarya Agrawal, Kate Saenko, Asli Celikyilmaz and Vikas Chandra
This paper provides an introduction to Vision-Language Models (VLMs), covering their training paradigms, evaluation methods, and practical considerations. VLMs aim to bridge the gap between vision and language, enabling applications such as image captioning and text-to-image generation. The paper categorizes VLMs into four main training paradigms: contrastive training, masking objectives, generative models, and pre-trained backbones. Each paradigm is explained with examples and recent advancements, such as CLIP, FLAVA, and MaskVLM. The paper also discusses the importance of training data, including data curation, pruning, and diversity, and provides guidance on software, hyperparameters, and model selection. Additionally, it covers techniques for improving grounding, alignment, and text-rich image understanding, as well as parameter-efficient fine-tuning. The evaluation section highlights the challenges and limitations of current benchmarks, emphasizing the need for bias and hallucination measurements. Finally, the paper explores the extension of VLMs to video data, noting the computational and temporal challenges. The goal is to provide a comprehensive guide for researchers and practitioners entering the field of VLMs.This paper provides an introduction to Vision-Language Models (VLMs), covering their training paradigms, evaluation methods, and practical considerations. VLMs aim to bridge the gap between vision and language, enabling applications such as image captioning and text-to-image generation. The paper categorizes VLMs into four main training paradigms: contrastive training, masking objectives, generative models, and pre-trained backbones. Each paradigm is explained with examples and recent advancements, such as CLIP, FLAVA, and MaskVLM. The paper also discusses the importance of training data, including data curation, pruning, and diversity, and provides guidance on software, hyperparameters, and model selection. Additionally, it covers techniques for improving grounding, alignment, and text-rich image understanding, as well as parameter-efficient fine-tuning. The evaluation section highlights the challenges and limitations of current benchmarks, emphasizing the need for bias and hallucination measurements. Finally, the paper explores the extension of VLMs to video data, noting the computational and temporal challenges. The goal is to provide a comprehensive guide for researchers and practitioners entering the field of VLMs.