18 Aug 2024 | Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li
MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
This paper introduces MATHVERSE, a comprehensive and specialized benchmark for evaluating the visual mathematical reasoning capabilities of Multi-modal Large Language Models (MLLMs). The benchmark includes 2,612 high-quality math problems with diagrams, transformed into six versions with varying degrees of textual and visual information. These versions allow for a detailed assessment of whether and how much MLLMs can truly understand visual diagrams for mathematical reasoning.
The paper highlights three main issues with existing benchmarks: (1) whether MLLMs truly see the diagrams, (2) whether assessments should be based solely on final answers, and (3) whether the benchmarks focus on mathematical reasoning evaluation. To address these issues, the paper proposes a Chain-of-Thought (CoT) evaluation strategy, which assesses the intermediate reasoning steps of MLLMs, providing a detailed error analysis.
The experiments show that most existing MLLMs rely heavily on textual information rather than visual diagrams to solve math problems. However, some models, such as Qwen-VL-Max and InternLM-XComposer2, achieve higher accuracy without visual input. GPT-4V and ShareGPT4V demonstrate better visual comprehension for mathematical reasoning.
The paper also discusses the importance of mathematical visual interpretation for MLLMs, indicating that inadequate visual interpretation capabilities are a significant barrier to addressing multi-modal math problems. The results suggest that there is substantial potential for improvement in this area.
The contributions of this paper include the introduction of MATHVERSE, a detailed analysis of the visual mathematical reasoning capabilities of MLLMs, and the proposal of a CoT evaluation strategy for fine-grained assessment of MLLMs. The benchmark provides a comprehensive evaluation of MLLMs, highlighting the need for improved visual encoding capabilities for mathematical diagrams.MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
This paper introduces MATHVERSE, a comprehensive and specialized benchmark for evaluating the visual mathematical reasoning capabilities of Multi-modal Large Language Models (MLLMs). The benchmark includes 2,612 high-quality math problems with diagrams, transformed into six versions with varying degrees of textual and visual information. These versions allow for a detailed assessment of whether and how much MLLMs can truly understand visual diagrams for mathematical reasoning.
The paper highlights three main issues with existing benchmarks: (1) whether MLLMs truly see the diagrams, (2) whether assessments should be based solely on final answers, and (3) whether the benchmarks focus on mathematical reasoning evaluation. To address these issues, the paper proposes a Chain-of-Thought (CoT) evaluation strategy, which assesses the intermediate reasoning steps of MLLMs, providing a detailed error analysis.
The experiments show that most existing MLLMs rely heavily on textual information rather than visual diagrams to solve math problems. However, some models, such as Qwen-VL-Max and InternLM-XComposer2, achieve higher accuracy without visual input. GPT-4V and ShareGPT4V demonstrate better visual comprehension for mathematical reasoning.
The paper also discusses the importance of mathematical visual interpretation for MLLMs, indicating that inadequate visual interpretation capabilities are a significant barrier to addressing multi-modal math problems. The results suggest that there is substantial potential for improvement in this area.
The contributions of this paper include the introduction of MATHVERSE, a detailed analysis of the visual mathematical reasoning capabilities of MLLMs, and the proposal of a CoT evaluation strategy for fine-grained assessment of MLLMs. The benchmark provides a comprehensive evaluation of MLLMs, highlighting the need for improved visual encoding capabilities for mathematical diagrams.