EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models

EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models

15 Mar 2024 | Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov
EXAMS-V is a new multi-disciplinary, multilingual, and multimodal exam benchmark for evaluating vision language models (VLMs). It contains 20,932 multiple-choice questions across 20 school subjects, covering natural science, social science, and other areas like religion, fine arts, and business. The dataset includes text, images, tables, figures, diagrams, maps, scientific symbols, and equations, and is available in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries with diverse education systems, requiring intricate reasoning across languages and region-specific knowledge. Solving the questions requires advanced perception and joint reasoning over text and visual content. Evaluation results show that EXAMS-V is a challenging dataset, even for advanced models like GPT-4V and Gemini. The dataset includes a wide range of question types, including those with accompanying images, tables, graphs, and mathematical and chemistry equations. It is designed to test the ability of VLMs to process and reason over multimodal content, including text, images, tables, graphs, and scientific symbols. The dataset is split into training and test sets, with careful consideration for language and subject representation. The evaluation includes state-of-the-art LLMs and VLMs, with results showing that GPT-4V performs best, achieving an overall average score of 42.78%. The dataset also includes parallel questions in multiple languages, allowing for cross-language comparisons. The dataset is evaluated for data quality, with high-quality samples across most languages. The dataset is designed to provide a realistic and challenging benchmark for VLMs, reflecting the complexity and diversity of real-world information processing. The dataset includes a wide range of subjects, including natural sciences, social sciences, and others, with a focus on high school-level knowledge. The dataset is evaluated for performance across different languages and subjects, with results showing that some languages are more challenging than others. The dataset is also evaluated for performance on different vision features, including scientific symbols, figures, graphs, and tabular data. The dataset is designed to test the ability of VLMs to process and reason over multimodal content, including text, images, tables, graphs, and scientific symbols. The dataset is evaluated for performance across different languages and subjects, with results showing that some languages are more challenging than others. The dataset is also evaluated for performance on different vision features, including scientific symbols, figures, graphs, and tabular data.EXAMS-V is a new multi-disciplinary, multilingual, and multimodal exam benchmark for evaluating vision language models (VLMs). It contains 20,932 multiple-choice questions across 20 school subjects, covering natural science, social science, and other areas like religion, fine arts, and business. The dataset includes text, images, tables, figures, diagrams, maps, scientific symbols, and equations, and is available in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries with diverse education systems, requiring intricate reasoning across languages and region-specific knowledge. Solving the questions requires advanced perception and joint reasoning over text and visual content. Evaluation results show that EXAMS-V is a challenging dataset, even for advanced models like GPT-4V and Gemini. The dataset includes a wide range of question types, including those with accompanying images, tables, graphs, and mathematical and chemistry equations. It is designed to test the ability of VLMs to process and reason over multimodal content, including text, images, tables, graphs, and scientific symbols. The dataset is split into training and test sets, with careful consideration for language and subject representation. The evaluation includes state-of-the-art LLMs and VLMs, with results showing that GPT-4V performs best, achieving an overall average score of 42.78%. The dataset also includes parallel questions in multiple languages, allowing for cross-language comparisons. The dataset is evaluated for data quality, with high-quality samples across most languages. The dataset is designed to provide a realistic and challenging benchmark for VLMs, reflecting the complexity and diversity of real-world information processing. The dataset includes a wide range of subjects, including natural sciences, social sciences, and others, with a focus on high school-level knowledge. The dataset is evaluated for performance across different languages and subjects, with results showing that some languages are more challenging than others. The dataset is also evaluated for performance on different vision features, including scientific symbols, figures, graphs, and tabular data. The dataset is designed to test the ability of VLMs to process and reason over multimodal content, including text, images, tables, graphs, and scientific symbols. The dataset is evaluated for performance across different languages and subjects, with results showing that some languages are more challenging than others. The dataset is also evaluated for performance on different vision features, including scientific symbols, figures, graphs, and tabular data.
Reach us at info@study.space
[slides] EXAMS-V%3A A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models | StudySpace