[slides and audio] BLINK%3A Multimodal Large Language Models Can See but Not Perceive

BLINK is a new benchmark designed to evaluate the visual perception capabilities of multimodal language models (LLMs). The benchmark consists of 14 classic computer vision tasks that are challenging for current LLMs but can be solved by humans with ease. These tasks range from low-level pattern matching to mid-level reasoning and high-level visual understanding. BLINK reformats these tasks into multiple-choice questions, paired with single or multiple images and visual prompts. Despite the simplicity for humans, the benchmark poses significant challenges for existing multimodal LLMs, with even the best-performing models achieving only around 50% accuracy. The study highlights that specialist computer vision models perform much better on these tasks, suggesting potential pathways for future improvements. BLINK aims to bridge the gap between traditional notions of perception and the modern capabilities of multimodal LLMs, providing a testbed for researchers to explore and enhance visual perception in LLMs.BLINK is a new benchmark designed to evaluate the visual perception capabilities of multimodal language models (LLMs). The benchmark consists of 14 classic computer vision tasks that are challenging for current LLMs but can be solved by humans with ease. These tasks range from low-level pattern matching to mid-level reasoning and high-level visual understanding. BLINK reformats these tasks into multiple-choice questions, paired with single or multiple images and visual prompts. Despite the simplicity for humans, the benchmark poses significant challenges for existing multimodal LLMs, with even the best-performing models achieving only around 50% accuracy. The study highlights that specialist computer vision models perform much better on these tasks, suggesting potential pathways for future improvements. BLINK aims to bridge the gap between traditional notions of perception and the modern capabilities of multimodal LLMs, providing a testbed for researchers to explore and enhance visual perception in LLMs.

BLINK: Multimodal Large Language Models Can See but Not Perceive

3 Jul 2024 | Xingyu Fu1, Yushi Hu2,3, Bangzhong Li4, Yu Feng1, Haoyu Wang1, Xudong Lin5, Dan Roth1, Noah A. Smith2,3, Wei-Chiu Ma3†, Ranjay Krishna2,3†

BLINK: Multimodal Large Language Models Can See but Not Perceive

3 Jul 2024 | Xingyu Fu1*, Yushi Hu2,3*, Bangzhong Li4, Yu Feng1, Haoyu Wang1, Xudong Lin5, Dan Roth1, Noah A. Smith2,3, Wei-Chiu Ma3†, Ranjay Krishna2,3†

3 Jul 2024 | Xingyu Fu1, Yushi Hu2,3, Bangzhong Li4, Yu Feng1, Haoyu Wang1, Xudong Lin5, Dan Roth1, Noah A. Smith2,3, Wei-Chiu Ma3†, Ranjay Krishna2,3†