WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

1 Jul 2024 | Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaohuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang
WE-MATH is a benchmark designed to evaluate the mathematical reasoning capabilities of large multimodal models (LMMs). It includes 6,500 visual math problems across 67 knowledge concepts and five layers of granularity. The benchmark introduces a four-dimensional metric to assess LMMs' reasoning: Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM). The benchmark decomposes complex problems into sub-problems based on knowledge concepts, enabling a detailed analysis of LMMs' reasoning processes. WE-MATH reveals that LMMs often struggle with knowledge acquisition and generalization, with GPT-4o showing significant improvement in knowledge generalization. Other LMMs, however, tend to rely on rote memorization, solving composite problems but failing sub-problems. The benchmark also introduces a knowledge concept augmentation strategy to enhance LMMs' reasoning abilities. Results show that larger models generally perform better, but smaller models can also achieve strong performance with appropriate augmentation. WE-MATH provides a comprehensive evaluation of LMMs in visual mathematical reasoning, highlighting the importance of knowledge-based reasoning and the need for further improvements in generalization and understanding. The benchmark aims to advance the development of LMMs in visual mathematical reasoning by providing a structured and detailed evaluation framework.WE-MATH is a benchmark designed to evaluate the mathematical reasoning capabilities of large multimodal models (LMMs). It includes 6,500 visual math problems across 67 knowledge concepts and five layers of granularity. The benchmark introduces a four-dimensional metric to assess LMMs' reasoning: Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM). The benchmark decomposes complex problems into sub-problems based on knowledge concepts, enabling a detailed analysis of LMMs' reasoning processes. WE-MATH reveals that LMMs often struggle with knowledge acquisition and generalization, with GPT-4o showing significant improvement in knowledge generalization. Other LMMs, however, tend to rely on rote memorization, solving composite problems but failing sub-problems. The benchmark also introduces a knowledge concept augmentation strategy to enhance LMMs' reasoning abilities. Results show that larger models generally perform better, but smaller models can also achieve strong performance with appropriate augmentation. WE-MATH provides a comprehensive evaluation of LMMs in visual mathematical reasoning, highlighting the importance of knowledge-based reasoning and the need for further improvements in generalization and understanding. The benchmark aims to advance the development of LMMs in visual mathematical reasoning by providing a structured and detailed evaluation framework.
Reach us at info@study.space
[slides] We-Math%3A Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning%3F | StudySpace