15 Mar 2024 | Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov
**EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models**
**Authors:** Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov
**Institution:** MBZUAI, Sofia University
**Abstract:**
EXAMS-V is a new benchmark designed to evaluate vision language models (VLMs) across 20 school disciplines, covering natural science, social science, and other miscellaneous studies. The dataset includes 20,932 multiple-choice questions in 11 languages from 7 language families, featuring a variety of multimodal elements such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. Unlike existing benchmarks, EXAMS-V is curated from school exams across various countries, requiring models to reason over a unified snapshot that integrates both textual and visual information. The dataset is challenging for advanced models like GPT-4V and Gemini, highlighting its significance as a future benchmark.
**Key Contributions:**
- Introduces a novel benchmark for VLMs, requiring them to reason over a unified snapshot of text and visual content.
- Evaluates state-of-the-art large language models and VLMs on the proposed dataset.
**Dataset Overview:**
- **Size:** 20,932 questions
- **Languages:** 11 languages from 7 families
- **Subjects:** 20 subjects spanning natural sciences, social sciences, and miscellaneous studies
- **Multimodal Features:** Text, images, tables, figures, diagrams, maps, scientific symbols, and equations
**Evaluation:**
- **Models:** LLaVA-1.5, Qwen-VL-Chat, GPT-4V, Gemini-Pro-Vision, augmented GPT4 with OCR and captioning
- **Metrics:** Accuracy
- **Results:** GPT-4V achieves the highest performance with an average score of 42.78%, followed by Gemini-V at 31.13%. Augmented GPT4 shows superior performance in several languages.
**Conclusion:**
EXAMS-V is a significant benchmark for assessing the multilingual and multimodal capabilities of VLMs, requiring advanced perception and reasoning over combined textual and visual information. Future work will focus on extending the dataset with more image samples, subjects, languages, and modalities.**EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models**
**Authors:** Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov
**Institution:** MBZUAI, Sofia University
**Abstract:**
EXAMS-V is a new benchmark designed to evaluate vision language models (VLMs) across 20 school disciplines, covering natural science, social science, and other miscellaneous studies. The dataset includes 20,932 multiple-choice questions in 11 languages from 7 language families, featuring a variety of multimodal elements such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. Unlike existing benchmarks, EXAMS-V is curated from school exams across various countries, requiring models to reason over a unified snapshot that integrates both textual and visual information. The dataset is challenging for advanced models like GPT-4V and Gemini, highlighting its significance as a future benchmark.
**Key Contributions:**
- Introduces a novel benchmark for VLMs, requiring them to reason over a unified snapshot of text and visual content.
- Evaluates state-of-the-art large language models and VLMs on the proposed dataset.
**Dataset Overview:**
- **Size:** 20,932 questions
- **Languages:** 11 languages from 7 families
- **Subjects:** 20 subjects spanning natural sciences, social sciences, and miscellaneous studies
- **Multimodal Features:** Text, images, tables, figures, diagrams, maps, scientific symbols, and equations
**Evaluation:**
- **Models:** LLaVA-1.5, Qwen-VL-Chat, GPT-4V, Gemini-Pro-Vision, augmented GPT4 with OCR and captioning
- **Metrics:** Accuracy
- **Results:** GPT-4V achieves the highest performance with an average score of 42.78%, followed by Gemini-V at 31.13%. Augmented GPT4 shows superior performance in several languages.
**Conclusion:**
EXAMS-V is a significant benchmark for assessing the multilingual and multimodal capabilities of VLMs, requiring advanced perception and reasoning over combined textual and visual information. Future work will focus on extending the dataset with more image samples, subjects, languages, and modalities.