BLINK is a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. The benchmark includes 14 visual perception tasks that can be solved by humans "within a blink," but pose significant challenges for current multimodal LLMs. These tasks are inspired by classical computer vision problems and recast into multiple-choice questions for multimodal LLMs to answer. Humans achieve an average accuracy of 95.70%, while the best-performing models, such as GPT-4V and Gemini, achieve only 51.26% and 45.72% accuracy, respectively, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist computer vision models could solve these problems much better, suggesting potential pathways for future improvements. We believe BLINK will stimulate the community to help multimodal LLMs catch up with human-level visual perception. The benchmark includes 3,807 multiple-choice questions across 7,300 images, with tasks ranging from low-level pattern matching to high-level visual understanding. BLINK features diverse visual prompts, evaluates a comprehensive range of visual perception abilities, and includes "visual commonsense" problems that humans can answer within seconds. The benchmark also includes interleaved image-text formats and diverse image sources, enabling a comprehensive examination of visual perception. The results show that while humans can answer the questions with high accuracy, BLINK is challenging for existing models. The benchmark highlights the significant visual perception gap between current machine learning models and humans in perceiving, processing, and understanding complex visual and textual context.BLINK is a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. The benchmark includes 14 visual perception tasks that can be solved by humans "within a blink," but pose significant challenges for current multimodal LLMs. These tasks are inspired by classical computer vision problems and recast into multiple-choice questions for multimodal LLMs to answer. Humans achieve an average accuracy of 95.70%, while the best-performing models, such as GPT-4V and Gemini, achieve only 51.26% and 45.72% accuracy, respectively, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist computer vision models could solve these problems much better, suggesting potential pathways for future improvements. We believe BLINK will stimulate the community to help multimodal LLMs catch up with human-level visual perception. The benchmark includes 3,807 multiple-choice questions across 7,300 images, with tasks ranging from low-level pattern matching to high-level visual understanding. BLINK features diverse visual prompts, evaluates a comprehensive range of visual perception abilities, and includes "visual commonsense" problems that humans can answer within seconds. The benchmark also includes interleaved image-text formats and diverse image sources, enabling a comprehensive examination of visual perception. The results show that while humans can answer the questions with high accuracy, BLINK is challenging for existing models. The benchmark highlights the significant visual perception gap between current machine learning models and humans in perceiving, processing, and understanding complex visual and textual context.