[slides] We-Math%3A Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning%3F

**Abstract:** Visual mathematical reasoning, a fundamental skill, has gained significant attention in the Large Multimodal Models (LMMs) community. Existing benchmarks focus on result-oriented performance but overlook the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, WE-MATH is introduced as the first benchmark designed to explore problem-solving principles beyond end-to-end performance. It meticulously collects and categorizes 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. The benchmark introduces a novel four-dimensional metric—*Inadequate Knowledge (IK)*, *Inadequate Generalization (IG)*, *Complete Mastery (CM)*, and *Rote Memorization (RM)*—to hierarchically assess inherent issues in LMMs' reasoning processes. Evaluations reveal a negative correlation between solving steps and problem-specific performance, with GPT-4o showing significant improvement in addressing *IK* issues, transitioning from *IK* to *IG* challenges. Other LMMs exhibit a marked inclination towards *Rote Memorization*, solving composite problems correctly but failing sub-problems. WE-MATH opens new pathways for advancements in visual mathematical reasoning for LMMs, with data and evaluation code available at https://github.com/We-Math/We-Math. **Introduction:** Visual mathematical reasoning is a critical capability for foundational models. Existing methods employ various techniques to guide LMMs towards human-like reasoning patterns. WE-MATH, a pioneering benchmark, evaluates LMMs' reasoning processes based on knowledge concepts, decomposing composite problems into sub-problems. It features a hierarchical knowledge structure, knowledge-based reasoning evaluation, and knowledge concept augmentation. The benchmark consists of 6.5K problems, categorized into 5 layers and 67 knowledge concepts. Problems are decomposed into sub-problems based on knowledge concepts, and LMMs are evaluated using a four-dimensional metric. The evaluation reveals that the number of knowledge concepts negatively correlates with LMMs' performance, and GPT-4o shows significant improvement in addressing *IK* issues, transitioning to *IG* challenges. Other LMMs exhibit *Rote Memorization* issues, solving composite problems correctly but failing sub-problems. WE-MATH provides new insights and tools for advancing visual mathematical reasoning in LMMs.**Abstract:** Visual mathematical reasoning, a fundamental skill, has gained significant attention in the Large Multimodal Models (LMMs) community. Existing benchmarks focus on result-oriented performance but overlook the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, WE-MATH is introduced as the first benchmark designed to explore problem-solving principles beyond end-to-end performance. It meticulously collects and categorizes 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. The benchmark introduces a novel four-dimensional metric—*Inadequate Knowledge (IK)*, *Inadequate Generalization (IG)*, *Complete Mastery (CM)*, and *Rote Memorization (RM)*—to hierarchically assess inherent issues in LMMs' reasoning processes. Evaluations reveal a negative correlation between solving steps and problem-specific performance, with GPT-4o showing significant improvement in addressing *IK* issues, transitioning from *IK* to *IG* challenges. Other LMMs exhibit a marked inclination towards *Rote Memorization*, solving composite problems correctly but failing sub-problems. WE-MATH opens new pathways for advancements in visual mathematical reasoning for LMMs, with data and evaluation code available at https://github.com/We-Math/We-Math. **Introduction:** Visual mathematical reasoning is a critical capability for foundational models. Existing methods employ various techniques to guide LMMs towards human-like reasoning patterns. WE-MATH, a pioneering benchmark, evaluates LMMs' reasoning processes based on knowledge concepts, decomposing composite problems into sub-problems. It features a hierarchical knowledge structure, knowledge-based reasoning evaluation, and knowledge concept augmentation. The benchmark consists of 6.5K problems, categorized into 5 layers and 67 knowledge concepts. Problems are decomposed into sub-problems based on knowledge concepts, and LMMs are evaluated using a four-dimensional metric. The evaluation reveals that the number of knowledge concepts negatively correlates with LMMs' performance, and GPT-4o shows significant improvement in addressing *IK* issues, transitioning to *IG* challenges. Other LMMs exhibit *Rote Memorization* issues, solving composite problems correctly but failing sub-problems. WE-MATH provides new insights and tools for advancing visual mathematical reasoning in LMMs.

WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

1 Jul 2024 | Runqi Qiao1, Qiuna Tan1, Guanting Dong1, Minhui Wu2, Chong Sun2, Xiaoshuai Song1, Zhuoma GongQue1, Shanglin Lei3, Zhe Wei1, Miaoxuan Zhang1, Runfeng Qiao4, Yifan Zhang1, Xiao Zong1, Yida Xu1, Muxi Diao1, Zhimin Bao2, Chen Li2, Honggang Zhang1†

WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

1 Jul 2024 | Runqi Qiao1*, Qiuna Tan1*, Guanting Dong1, Minhui Wu2, Chong Sun2, Xiaoshuai Song1, Zhuoma GongQue1, Shanglin Lei3, Zhe Wei1, Miaoxuan Zhang1, Runfeng Qiao4, Yifan Zhang1, Xiao Zong1, Yida Xu1, Muxi Diao1, Zhimin Bao2, Chen Li2, Honggang Zhang1†

1 Jul 2024 | Runqi Qiao1, Qiuna Tan1, Guanting Dong1, Minhui Wu2, Chong Sun2, Xiaoshuai Song1, Zhuoma GongQue1, Shanglin Lei3, Zhe Wei1, Miaoxuan Zhang1, Runfeng Qiao4, Yifan Zhang1, Xiao Zong1, Yida Xu1, Muxi Diao1, Zhimin Bao2, Chen Li2, Honggang Zhang1†