The paper "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs" investigates the visual capabilities of multimodal large language models (MMLMs), particularly focusing on the limitations of GPT-4V. The authors identify systematic shortcomings in MMLMs, which often struggle with simple visual questions due to inaccurate visual grounding. They introduce the Multimodal Visual Patterns (MMVP) benchmark, which consists of 150 pairs of images with 300 questions designed to probe visual details that MMLMs fail to capture. The benchmark reveals that state-of-the-art models, including GPT-4V, perform poorly on straightforward visual questions, often providing incorrect answers and hallucinated explanations.
The study further explores the visual patterns that challenge CLIP models and their correlation with the performance of MMLMs. Nine visual patterns are identified, such as orientation, counting, and viewpoint, which pose significant challenges for both CLIP and MMLMs. The authors find that these visual patterns are not resolved by scaling up CLIP models alone, suggesting that the visual encoders themselves are a bottleneck.
To address these issues, the authors propose a Mixture of Features (MoF) approach, which integrates vision-only self-supervised learning features with MMLMs. This method enhances visual grounding capabilities while maintaining instruction-following abilities. The results show that Interleaved-MoF significantly improves visual grounding without compromising instruction-following, demonstrating the effectiveness of the proposed approach.
The paper concludes by emphasizing the need for more diverse evaluation metrics in visual representation learning to better align with current and emerging applications. It also highlights the importance of developing new visual encoders to overcome the limitations of current models.The paper "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs" investigates the visual capabilities of multimodal large language models (MMLMs), particularly focusing on the limitations of GPT-4V. The authors identify systematic shortcomings in MMLMs, which often struggle with simple visual questions due to inaccurate visual grounding. They introduce the Multimodal Visual Patterns (MMVP) benchmark, which consists of 150 pairs of images with 300 questions designed to probe visual details that MMLMs fail to capture. The benchmark reveals that state-of-the-art models, including GPT-4V, perform poorly on straightforward visual questions, often providing incorrect answers and hallucinated explanations.
The study further explores the visual patterns that challenge CLIP models and their correlation with the performance of MMLMs. Nine visual patterns are identified, such as orientation, counting, and viewpoint, which pose significant challenges for both CLIP and MMLMs. The authors find that these visual patterns are not resolved by scaling up CLIP models alone, suggesting that the visual encoders themselves are a bottleneck.
To address these issues, the authors propose a Mixture of Features (MoF) approach, which integrates vision-only self-supervised learning features with MMLMs. This method enhances visual grounding capabilities while maintaining instruction-following abilities. The results show that Interleaved-MoF significantly improves visual grounding without compromising instruction-following, demonstrating the effectiveness of the proposed approach.
The paper concludes by emphasizing the need for more diverse evaluation metrics in visual representation learning to better align with current and emerging applications. It also highlights the importance of developing new visual encoders to overcome the limitations of current models.