SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

25 Apr 2024 | Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan
SEED-Bench-2-Plus is a comprehensive benchmark designed to evaluate the text-rich visual comprehension capabilities of Multimodal Large Language Models (MLLMs). The benchmark includes 2.3K multiple-choice questions with precise human annotations, covering three broad categories: Charts, Maps, and Webs. These categories simulate real-world text-rich environments, including various types of charts, maps, and web screenshots. The evaluation involves 34 prominent MLLMs, including GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus, revealing the current limitations and strengths of these models in text-rich visual comprehension. The dataset and evaluation code are publicly available, aiming to provide valuable insights and inspire further research in this area. Key findings include the complexity of text-rich data, varying difficulty levels across different data types, and performance disparities among leading MLLMs. The benchmark serves as a valuable supplement to existing MLLM benchmarks, highlighting the need for advancements in MLLMs' proficiency in handling text-rich scenarios.SEED-Bench-2-Plus is a comprehensive benchmark designed to evaluate the text-rich visual comprehension capabilities of Multimodal Large Language Models (MLLMs). The benchmark includes 2.3K multiple-choice questions with precise human annotations, covering three broad categories: Charts, Maps, and Webs. These categories simulate real-world text-rich environments, including various types of charts, maps, and web screenshots. The evaluation involves 34 prominent MLLMs, including GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus, revealing the current limitations and strengths of these models in text-rich visual comprehension. The dataset and evaluation code are publicly available, aiming to provide valuable insights and inspire further research in this area. Key findings include the complexity of text-rich data, varying difficulty levels across different data types, and performance disparities among leading MLLMs. The benchmark serves as a valuable supplement to existing MLLM benchmarks, highlighting the need for advancements in MLLMs' proficiency in handling text-rich scenarios.
Reach us at info@study.space