Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

22 Feb 2024 | Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li
The MATH-Vision (MATH-V) dataset is a new benchmark for evaluating the mathematical reasoning abilities of large multimodal models (LMMs). It consists of 3,040 high-quality math problems with visual contexts, sourced from real math competitions. The dataset covers 16 distinct mathematical disciplines and is divided into five levels of difficulty. It is carefully curated and validated by multiple expert annotators to ensure accuracy and diversity. The dataset includes both open-ended and multiple-choice questions, with a balanced distribution across subjects and difficulty levels. The MATH-V dataset was created to address the limitations of existing benchmarks, which often lack diversity in question types and subjects. The dataset provides a comprehensive set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experiments, the authors found a significant performance gap between current LMMs and human performance on MATH-V. The top-performing models, such as GPT-4V and Gemini, achieved scores of 22.76 and 75.66, respectively, while human performance was significantly higher. The results indicate that current LMMs are not yet comparable to average humans in terms of mathematical reasoning in visual contexts. The MATH-V dataset is compared with existing benchmarks such as MathVista and MMMU. It is noted that MATH-V is more challenging than these benchmarks, as it includes newly collected questions with more diverse subjects and difficulty levels. The dataset also provides a detailed categorization of problems, allowing for a thorough error analysis of LMMs. This analysis highlights the strengths and weaknesses of current models in mathematical reasoning tasks. The authors conducted experiments with various models, including large language models (LLMs) and LMMs, and found that closed-source models generally outperformed open-source models. The results also showed that Chain-of-Thought prompting did not consistently improve performance across all models. The error analysis of GPT-4V revealed that the most common errors were due to reasoning and vision recognition issues, indicating that the model struggles with complex mathematical reasoning tasks. The MATH-V dataset is intended to provide a reliable benchmark for evaluating the mathematical reasoning abilities of LMMs. It is designed to be a comprehensive and diverse set of challenges for evaluating the performance of these models. The dataset is available for research and evaluation, and its use is intended to facilitate further studies in this area. The authors also note that the dataset is limited to English problems and does not include problems in other languages or subjects such as physics and chemistry.The MATH-Vision (MATH-V) dataset is a new benchmark for evaluating the mathematical reasoning abilities of large multimodal models (LMMs). It consists of 3,040 high-quality math problems with visual contexts, sourced from real math competitions. The dataset covers 16 distinct mathematical disciplines and is divided into five levels of difficulty. It is carefully curated and validated by multiple expert annotators to ensure accuracy and diversity. The dataset includes both open-ended and multiple-choice questions, with a balanced distribution across subjects and difficulty levels. The MATH-V dataset was created to address the limitations of existing benchmarks, which often lack diversity in question types and subjects. The dataset provides a comprehensive set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experiments, the authors found a significant performance gap between current LMMs and human performance on MATH-V. The top-performing models, such as GPT-4V and Gemini, achieved scores of 22.76 and 75.66, respectively, while human performance was significantly higher. The results indicate that current LMMs are not yet comparable to average humans in terms of mathematical reasoning in visual contexts. The MATH-V dataset is compared with existing benchmarks such as MathVista and MMMU. It is noted that MATH-V is more challenging than these benchmarks, as it includes newly collected questions with more diverse subjects and difficulty levels. The dataset also provides a detailed categorization of problems, allowing for a thorough error analysis of LMMs. This analysis highlights the strengths and weaknesses of current models in mathematical reasoning tasks. The authors conducted experiments with various models, including large language models (LLMs) and LMMs, and found that closed-source models generally outperformed open-source models. The results also showed that Chain-of-Thought prompting did not consistently improve performance across all models. The error analysis of GPT-4V revealed that the most common errors were due to reasoning and vision recognition issues, indicating that the model struggles with complex mathematical reasoning tasks. The MATH-V dataset is intended to provide a reliable benchmark for evaluating the mathematical reasoning abilities of LMMs. It is designed to be a comprehensive and diverse set of challenges for evaluating the performance of these models. The dataset is available for research and evaluation, and its use is intended to facilitate further studies in this area. The authors also note that the dataset is limited to English problems and does not include problems in other languages or subjects such as physics and chemistry.
Reach us at info@study.space