27 May 2024 | Florian Bordes, Richard Yuzhanze Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaqing Ellen Tan, Megan Richards, Samuel Lavoie, Pietro Astolfi, Reyhane Askari Hemmat, Jun Chen, Kushal Tirumala, Rim Assouel, Mazda Moayeri, Arjang Talattof, Kamalika Chaudhuri, Zechun Liu, Xilun Chen, Quentin Garrido, Karen Ullrich, Aishwarya Agrawal, Kate Saenko, Asli Celikyilmaz and Vikas Chandra
This paper introduces Vision-Language Models (VLMs), which combine visual and linguistic understanding. VLMs have shown significant potential in various applications, from visual assistants to generative models. However, there are challenges in developing reliable VLMs, such as understanding spatial relationships and attributes, and avoiding hallucinations. The paper discusses different families of VLMs, including contrastive-based, masking-based, and generative-based models. It also covers training methods, data curation, and evaluation techniques for VLMs. The paper highlights the importance of data quality, training strategies, and evaluation benchmarks in improving VLM performance. It also discusses the extension of VLMs to video and the challenges involved. The paper concludes by emphasizing the need for responsible development of VLMs and the importance of understanding the underlying mechanisms of these models.This paper introduces Vision-Language Models (VLMs), which combine visual and linguistic understanding. VLMs have shown significant potential in various applications, from visual assistants to generative models. However, there are challenges in developing reliable VLMs, such as understanding spatial relationships and attributes, and avoiding hallucinations. The paper discusses different families of VLMs, including contrastive-based, masking-based, and generative-based models. It also covers training methods, data curation, and evaluation techniques for VLMs. The paper highlights the importance of data quality, training strategies, and evaluation benchmarks in improving VLM performance. It also discusses the extension of VLMs to video and the challenges involved. The paper concludes by emphasizing the need for responsible development of VLMs and the importance of understanding the underlying mechanisms of these models.