SEED-Bench-2-Plus is a benchmark designed to evaluate the performance of Multimodal Large Language Models (MLLMs) in text-rich visual comprehension. The benchmark includes 2.3K multiple-choice questions with human annotations, covering three categories: Charts, Maps, and Webs. These categories encompass a wide range of real-world text-rich scenarios, providing a comprehensive assessment of MLLMs' ability to understand and interpret complex visual data with extensive textual information.
The benchmark includes 34 prominent MLLMs, including GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus, and highlights the current limitations of MLLMs in text-rich visual comprehension. The dataset and evaluation code are publicly available at https://github.com/AILab-CVC/SEED-Bench. The benchmark evaluates MLLMs' ability to interpret texts, understand visual content, and discern interactions between textual and visual contexts.
The evaluation results show that GPT-4V performs superiorly across most evaluation types. However, the majority of MLLMs achieve lower results on text-rich data, with an average accuracy rate less than 40%. The results indicate that MLLMs need significant improvements in handling text-rich data, particularly in scenarios such as maps, which are inherently complex and multidimensional. The performance of MLLMs varies significantly across different data types, highlighting the need for models to be robust and adaptable to handle a diverse range of text-rich scenarios.
SEED-Bench-2-Plus serves as a valuable supplement to existing benchmarks, providing insights into the current state of MLLMs and inspiring further research in text-rich visual comprehension. The benchmark aims to not only benchmark current MLLM performance but also to catalyze further research in enhancing MLLMs' proficiency in multimodal comprehension in text-rich scenarios.SEED-Bench-2-Plus is a benchmark designed to evaluate the performance of Multimodal Large Language Models (MLLMs) in text-rich visual comprehension. The benchmark includes 2.3K multiple-choice questions with human annotations, covering three categories: Charts, Maps, and Webs. These categories encompass a wide range of real-world text-rich scenarios, providing a comprehensive assessment of MLLMs' ability to understand and interpret complex visual data with extensive textual information.
The benchmark includes 34 prominent MLLMs, including GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus, and highlights the current limitations of MLLMs in text-rich visual comprehension. The dataset and evaluation code are publicly available at https://github.com/AILab-CVC/SEED-Bench. The benchmark evaluates MLLMs' ability to interpret texts, understand visual content, and discern interactions between textual and visual contexts.
The evaluation results show that GPT-4V performs superiorly across most evaluation types. However, the majority of MLLMs achieve lower results on text-rich data, with an average accuracy rate less than 40%. The results indicate that MLLMs need significant improvements in handling text-rich data, particularly in scenarios such as maps, which are inherently complex and multidimensional. The performance of MLLMs varies significantly across different data types, highlighting the need for models to be robust and adaptable to handle a diverse range of text-rich scenarios.
SEED-Bench-2-Plus serves as a valuable supplement to existing benchmarks, providing insights into the current state of MLLMs and inspiring further research in text-rich visual comprehension. The benchmark aims to not only benchmark current MLLM performance but also to catalyze further research in enhancing MLLMs' proficiency in multimodal comprehension in text-rich scenarios.