Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

25 Apr 2024 | Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
This paper investigates the visual shortcomings of multimodal large language models (MLLMs), particularly focusing on the limitations of their visual grounding capabilities. The study reveals that despite the impressive performance of MLLMs in tasks like visual question answering, they often fail to accurately interpret visual details, especially when these details are subtle or complex. The research highlights that these failures are not due to language understanding or alignment issues, but rather stem from the visual representations used in the models, which are primarily based on the CLIP model. The paper introduces the Multimodal Visual Patterns (MMVP) benchmark, which systematically identifies and evaluates the visual patterns that challenge MLLMs. The benchmark is constructed using CLIP-blind pairs—images that CLIP perceives as similar despite their clear visual differences. These pairs are used to create questions that test the visual understanding of MLLMs. The results show that even advanced models like GPT-4V struggle with these basic visual questions, often providing incorrect answers and hallucinated explanations. The study further explores the relationship between the visual patterns that challenge CLIP models and those that challenge MLLMs. It finds a strong correlation, suggesting that the visual shortcomings of CLIP models are transferred to MLLMs. To address these issues, the paper proposes a Mixture-of-Features (MoF) approach, which integrates vision self-supervised learning features with MLLMs. This approach enhances the visual grounding capabilities of MLLMs without compromising their ability to follow instructions. The research concludes that visual representation learning remains a significant challenge, and accurate visual grounding is crucial for the success of future multimodal systems. The findings underscore the need for more diverse and comprehensive evaluations of visual models to better align with current and emerging applications.This paper investigates the visual shortcomings of multimodal large language models (MLLMs), particularly focusing on the limitations of their visual grounding capabilities. The study reveals that despite the impressive performance of MLLMs in tasks like visual question answering, they often fail to accurately interpret visual details, especially when these details are subtle or complex. The research highlights that these failures are not due to language understanding or alignment issues, but rather stem from the visual representations used in the models, which are primarily based on the CLIP model. The paper introduces the Multimodal Visual Patterns (MMVP) benchmark, which systematically identifies and evaluates the visual patterns that challenge MLLMs. The benchmark is constructed using CLIP-blind pairs—images that CLIP perceives as similar despite their clear visual differences. These pairs are used to create questions that test the visual understanding of MLLMs. The results show that even advanced models like GPT-4V struggle with these basic visual questions, often providing incorrect answers and hallucinated explanations. The study further explores the relationship between the visual patterns that challenge CLIP models and those that challenge MLLMs. It finds a strong correlation, suggesting that the visual shortcomings of CLIP models are transferred to MLLMs. To address these issues, the paper proposes a Mixture-of-Features (MoF) approach, which integrates vision self-supervised learning features with MLLMs. This approach enhances the visual grounding capabilities of MLLMs without compromising their ability to follow instructions. The research concludes that visual representation learning remains a significant challenge, and accurate visual grounding is crucial for the success of future multimodal systems. The findings underscore the need for more diverse and comprehensive evaluations of visual models to better align with current and emerging applications.
Reach us at info@study.space