28 May 2024 | Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruva Ghosh, Ludwig Schmidt, Serena Yeung-Levy
Visually-grounded language models (VLMs) perform poorly in image classification tasks compared to models like CLIP, despite having more parameters and often using CLIP as a vision encoder. The primary reason for this underperformance is data-related. Critical information for classification is encoded in the VLM's latent space but can only be effectively decoded with sufficient training data. The performance of VLMs is strongly correlated with the frequency of class exposure during training. When trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. The study shows that integrating classification-focused datasets into VLM training improves their classification performance and general capabilities. For example, fine-tuning LLaVA1.5-7B on ImageNet classification data improves its accuracy on ImageWikiQA by 11.8%. The research also highlights that VLMs struggle with image classification due to insufficient data and lack of diverse classes. By incorporating classification data into VLM training, the models' performance in both classification and general tasks is enhanced. The findings suggest that classification is fundamental to advanced visual capabilities, and improving VLMs' classification performance can lead to better overall performance. The study provides insights into the limitations of VLMs and proposes a solution by integrating classification data into their training process.Visually-grounded language models (VLMs) perform poorly in image classification tasks compared to models like CLIP, despite having more parameters and often using CLIP as a vision encoder. The primary reason for this underperformance is data-related. Critical information for classification is encoded in the VLM's latent space but can only be effectively decoded with sufficient training data. The performance of VLMs is strongly correlated with the frequency of class exposure during training. When trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. The study shows that integrating classification-focused datasets into VLM training improves their classification performance and general capabilities. For example, fine-tuning LLaVA1.5-7B on ImageNet classification data improves its accuracy on ImageWikiQA by 11.8%. The research also highlights that VLMs struggle with image classification due to insufficient data and lack of diverse classes. By incorporating classification data into VLM training, the models' performance in both classification and general tasks is enhanced. The findings suggest that classification is fundamental to advanced visual capabilities, and improving VLMs' classification performance can lead to better overall performance. The study provides insights into the limitations of VLMs and proposes a solution by integrating classification data into their training process.