22 Feb 2024 | Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, Hongsheng Li
The paper introduces the MATH-Vision (MATH-V) dataset, a comprehensive collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. The dataset spans 16 distinct mathematical disciplines and is graded across 5 levels of difficulty, providing a diverse set of challenges for evaluating the mathematical reasoning abilities of Large Multimodal Models (LMMs). The authors conduct extensive experiments to assess the performance of various LMMs on MATH-V, revealing a significant gap between current LMMs and human performance. The results highlight the need for further advancements in LMMs to achieve human-level performance in multimodal mathematical reasoning. The detailed categorization of problems allows for a thorough error analysis, offering valuable insights for future research. The paper also compares MATH-V with existing benchmarks, such as MathVista and MMMU, emphasizing the limitations of these datasets in terms of question diversity and subject coverage. The contributions of the study include the introduction of MATH-V, the benchmarking of LMMs, and a comprehensive error analysis, providing a foundation for future research in this field.The paper introduces the MATH-Vision (MATH-V) dataset, a comprehensive collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. The dataset spans 16 distinct mathematical disciplines and is graded across 5 levels of difficulty, providing a diverse set of challenges for evaluating the mathematical reasoning abilities of Large Multimodal Models (LMMs). The authors conduct extensive experiments to assess the performance of various LMMs on MATH-V, revealing a significant gap between current LMMs and human performance. The results highlight the need for further advancements in LMMs to achieve human-level performance in multimodal mathematical reasoning. The detailed categorization of problems allows for a thorough error analysis, offering valuable insights for future research. The paper also compares MATH-V with existing benchmarks, such as MathVista and MMMU, emphasizing the limitations of these datasets in terms of question diversity and subject coverage. The contributions of the study include the introduction of MATH-V, the benchmarking of LMMs, and a comprehensive error analysis, providing a foundation for future research in this field.