Vision language models are blind 🕶️

Vision language models are blind 🕶️

26 Jul 2024 | Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
Vision language models (VLMs) are still struggling with low-level vision tasks that are easy for humans, according to a new study. The research team tested four state-of-the-art VLMs—GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet—on seven simple visual tasks involving basic geometric shapes. These tasks required precise spatial understanding and recognition of overlapping or close shapes, which VLMs found challenging. The study found that VLMs struggled with tasks such as identifying whether two lines intersect, determining if two circles overlap, counting circles in an Olympic-like logo, and identifying which letter is circled in a word. On average, the VLMs achieved only 58.57% accuracy across the seven tasks, with Sonnet-3.5 performing the best at 74.94% accuracy. This is far below the expected 100% accuracy for humans. The researchers argue that VLMs are "blind" to low-level visual details because they rely on a "late fusion" approach, where visual features are extracted before considering the question. This approach makes it difficult for VLMs to accurately perceive and process fine details in images. The study highlights the limitations of current VLMs in tasks that require precise spatial understanding, such as counting overlapping or nested shapes, identifying letters within a red circle, and following colored paths in a simplified subway map. The findings suggest that VLMs are not as visually capable as humans, and that improvements in their ability to process low-level visual details are needed for them to perform better on tasks that require precise spatial reasoning. The study also emphasizes the importance of developing benchmarks that test VLMs on tasks that are easy for humans but challenging for current models.Vision language models (VLMs) are still struggling with low-level vision tasks that are easy for humans, according to a new study. The research team tested four state-of-the-art VLMs—GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet—on seven simple visual tasks involving basic geometric shapes. These tasks required precise spatial understanding and recognition of overlapping or close shapes, which VLMs found challenging. The study found that VLMs struggled with tasks such as identifying whether two lines intersect, determining if two circles overlap, counting circles in an Olympic-like logo, and identifying which letter is circled in a word. On average, the VLMs achieved only 58.57% accuracy across the seven tasks, with Sonnet-3.5 performing the best at 74.94% accuracy. This is far below the expected 100% accuracy for humans. The researchers argue that VLMs are "blind" to low-level visual details because they rely on a "late fusion" approach, where visual features are extracted before considering the question. This approach makes it difficult for VLMs to accurately perceive and process fine details in images. The study highlights the limitations of current VLMs in tasks that require precise spatial understanding, such as counting overlapping or nested shapes, identifying letters within a red circle, and following colored paths in a simplified subway map. The findings suggest that VLMs are not as visually capable as humans, and that improvements in their ability to process low-level visual details are needed for them to perform better on tasks that require precise spatial reasoning. The study also emphasizes the importance of developing benchmarks that test VLMs on tasks that are easy for humans but challenging for current models.
Reach us at info@study.space
Understanding Vision language models are blind