[slides] Vision language models are blind

The paper "Vision Language Models are Blind" by Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen explores the limitations of large vision-language models (VLMs) in performing low-level vision tasks that are simple for humans. Despite their advanced capabilities in various image-text applications and high scores on vision-understanding benchmarks, VLMs struggle with tasks that require precise spatial information and recognition of geometric primitives. The authors introduce a suite of seven simple visual tasks, including identifying intersections between lines or circles, counting circles in an Olympic logo, and identifying circled letters in words. Across different image resolutions and line widths, the average accuracy of four state-of-the-art VLMs (GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet) is only 58.57%, with Sonnet-3.5 performing the best at 74.94%. The study highlights that VLMs often rely on "late fusion" mechanisms, extracting visual features before processing the textual question, which leads to difficulties in recognizing subtle spatial relationships and geometric details. The paper suggests that early fusion approaches might be more effective in improving VLMs' vision capabilities.The paper "Vision Language Models are Blind" by Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen explores the limitations of large vision-language models (VLMs) in performing low-level vision tasks that are simple for humans. Despite their advanced capabilities in various image-text applications and high scores on vision-understanding benchmarks, VLMs struggle with tasks that require precise spatial information and recognition of geometric primitives. The authors introduce a suite of seven simple visual tasks, including identifying intersections between lines or circles, counting circles in an Olympic logo, and identifying circled letters in words. Across different image resolutions and line widths, the average accuracy of four state-of-the-art VLMs (GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet) is only 58.57%, with Sonnet-3.5 performing the best at 74.94%. The study highlights that VLMs often rely on "late fusion" mechanisms, extracting visual features before processing the textual question, which leads to difficulties in recognizing subtle spatial relationships and geometric details. The paper suggests that early fusion approaches might be more effective in improving VLMs' vision capabilities.

Vision language models are blind 🕶️

26 Jul 2024 | Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen