**Abstract:**
Visual mathematical reasoning, a fundamental skill, has gained significant attention in the Large Multimodal Models (LMMs) community. Existing benchmarks focus on result-oriented performance but overlook the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, WE-MATH is introduced as the first benchmark designed to explore problem-solving principles beyond end-to-end performance. It meticulously collects and categorizes 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. The benchmark introduces a novel four-dimensional metric—*Inadequate Knowledge (IK)*, *Inadequate Generalization (IG)*, *Complete Mastery (CM)*, and *Rote Memorization (RM)*—to hierarchically assess inherent issues in LMMs' reasoning processes. Evaluations reveal a negative correlation between solving steps and problem-specific performance, with GPT-4o showing significant improvement in addressing *IK* issues, transitioning from *IK* to *IG* challenges. Other LMMs exhibit a marked inclination towards *Rote Memorization*, solving composite problems correctly but failing sub-problems. WE-MATH opens new pathways for advancements in visual mathematical reasoning for LMMs, with data and evaluation code available at https://github.com/We-Math/We-Math.
**Introduction:**
Visual mathematical reasoning is a critical capability for foundational models. Existing methods employ various techniques to guide LMMs towards human-like reasoning patterns. WE-MATH, a pioneering benchmark, evaluates LMMs' reasoning processes based on knowledge concepts, decomposing composite problems into sub-problems. It features a hierarchical knowledge structure, knowledge-based reasoning evaluation, and knowledge concept augmentation. The benchmark consists of 6.5K problems, categorized into 5 layers and 67 knowledge concepts. Problems are decomposed into sub-problems based on knowledge concepts, and LMMs are evaluated using a four-dimensional metric. The evaluation reveals that the number of knowledge concepts negatively correlates with LMMs' performance, and GPT-4o shows significant improvement in addressing *IK* issues, transitioning to *IG* challenges. Other LMMs exhibit *Rote Memorization* issues, solving composite problems correctly but failing sub-problems. WE-MATH provides new insights and tools for advancing visual mathematical reasoning in LMMs.**Abstract:**
Visual mathematical reasoning, a fundamental skill, has gained significant attention in the Large Multimodal Models (LMMs) community. Existing benchmarks focus on result-oriented performance but overlook the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, WE-MATH is introduced as the first benchmark designed to explore problem-solving principles beyond end-to-end performance. It meticulously collects and categorizes 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. The benchmark introduces a novel four-dimensional metric—*Inadequate Knowledge (IK)*, *Inadequate Generalization (IG)*, *Complete Mastery (CM)*, and *Rote Memorization (RM)*—to hierarchically assess inherent issues in LMMs' reasoning processes. Evaluations reveal a negative correlation between solving steps and problem-specific performance, with GPT-4o showing significant improvement in addressing *IK* issues, transitioning from *IK* to *IG* challenges. Other LMMs exhibit a marked inclination towards *Rote Memorization*, solving composite problems correctly but failing sub-problems. WE-MATH opens new pathways for advancements in visual mathematical reasoning for LMMs, with data and evaluation code available at https://github.com/We-Math/We-Math.
**Introduction:**
Visual mathematical reasoning is a critical capability for foundational models. Existing methods employ various techniques to guide LMMs towards human-like reasoning patterns. WE-MATH, a pioneering benchmark, evaluates LMMs' reasoning processes based on knowledge concepts, decomposing composite problems into sub-problems. It features a hierarchical knowledge structure, knowledge-based reasoning evaluation, and knowledge concept augmentation. The benchmark consists of 6.5K problems, categorized into 5 layers and 67 knowledge concepts. Problems are decomposed into sub-problems based on knowledge concepts, and LMMs are evaluated using a four-dimensional metric. The evaluation reveals that the number of knowledge concepts negatively correlates with LMMs' performance, and GPT-4o shows significant improvement in addressing *IK* issues, transitioning to *IG* challenges. Other LMMs exhibit *Rote Memorization* issues, solving composite problems correctly but failing sub-problems. WE-MATH provides new insights and tools for advancing visual mathematical reasoning in LMMs.