[slides] Why are Visually-Grounded Language Models Bad at Image Classification%3F

The paper investigates the performance of visually-grounded language models (VLMs) in image classification tasks, finding that they significantly underperform compared to state-of-the-art models like CLIP. Despite using CLIP as a vision encoder and having more parameters, VLMs struggle with standard benchmarks such as ImageNet. The authors explore several hypotheses, including inference algorithms, training objectives, and data processing, and conclude that the primary issue is data-related. Critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with sufficient training data. Specifically, there is a strong correlation between the frequency of class exposure during training and the VLM's performance in those classes. Training VLMs on classification datasets can achieve the same performance as state-of-the-art classification models. Based on these findings, the authors propose enhancing VLMs by integrating classification-focused datasets into their training. They demonstrate that this approach improves classification performance and transfers to more advanced capabilities, such as visual question answering, resulting in an 11.8% improvement on the ImageWikiQA dataset. The study highlights the importance of classification data in the development of advanced visual intelligence models.The paper investigates the performance of visually-grounded language models (VLMs) in image classification tasks, finding that they significantly underperform compared to state-of-the-art models like CLIP. Despite using CLIP as a vision encoder and having more parameters, VLMs struggle with standard benchmarks such as ImageNet. The authors explore several hypotheses, including inference algorithms, training objectives, and data processing, and conclude that the primary issue is data-related. Critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with sufficient training data. Specifically, there is a strong correlation between the frequency of class exposure during training and the VLM's performance in those classes. Training VLMs on classification datasets can achieve the same performance as state-of-the-art classification models. Based on these findings, the authors propose enhancing VLMs by integrating classification-focused datasets into their training. They demonstrate that this approach improves classification performance and transfers to more advanced capabilities, such as visual question answering, resulting in an 11.8% improvement on the ImageWikiQA dataset. The study highlights the importance of classification data in the development of advanced visual intelligence models.

Why are Visually-Grounded Language Models Bad at Image Classification?

28 May 2024 | Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy