GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

17 May 2024 | Jiaxin Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu, Yashar Moshfeghi
The paper introduces GeoEval, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) and multi-modal models (MMs) in solving geometry math problems. The benchmark includes four subsets: GeoEval-2000, GeoEval-backward, GeoEval-aug, and GeoEval-hard, totaling 5,050 geometry math problems. These subsets cover a wide range of geometric shapes, problem types, and complexity levels, ensuring a thorough assessment of models' capabilities. Key findings from the evaluation of ten LLMs and MMs across these subsets reveal that models pre-trained on mathematical corpora, such as the WizardMath models, perform significantly better. Specifically, WizardMath-70B achieves an accuracy of 55.67% on the main subset but only 6.00% on the hard subset, highlighting the need for models to handle unseen or complex problems. Additionally, GPT-series models show enhanced performance when rephrasing geometry questions, suggesting the potential benefits of self-rephrasing in problem-solving. The study also highlights the importance of geometric diagram descriptions and external constants in improving model performance. WizardMath-7B-V1.1 maintains consistent accuracy regardless of the need for external constants, while other models struggle with this requirement. Furthermore, the performance of models decreases with longer and more complex problems, indicating the need for models to handle these challenges effectively. Overall, GeoEval provides a robust framework for advancing the evaluation of LLMs and MMs in geometry problem-solving, emphasizing the critical role of mathematical corpus pre-training and the importance of diverse and challenging datasets.The paper introduces GeoEval, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) and multi-modal models (MMs) in solving geometry math problems. The benchmark includes four subsets: GeoEval-2000, GeoEval-backward, GeoEval-aug, and GeoEval-hard, totaling 5,050 geometry math problems. These subsets cover a wide range of geometric shapes, problem types, and complexity levels, ensuring a thorough assessment of models' capabilities. Key findings from the evaluation of ten LLMs and MMs across these subsets reveal that models pre-trained on mathematical corpora, such as the WizardMath models, perform significantly better. Specifically, WizardMath-70B achieves an accuracy of 55.67% on the main subset but only 6.00% on the hard subset, highlighting the need for models to handle unseen or complex problems. Additionally, GPT-series models show enhanced performance when rephrasing geometry questions, suggesting the potential benefits of self-rephrasing in problem-solving. The study also highlights the importance of geometric diagram descriptions and external constants in improving model performance. WizardMath-7B-V1.1 maintains consistent accuracy regardless of the need for external constants, while other models struggle with this requirement. Furthermore, the performance of models decreases with longer and more complex problems, indicating the need for models to handle these challenges effectively. Overall, GeoEval provides a robust framework for advancing the evaluation of LLMs and MMs in geometry problem-solving, emphasizing the critical role of mathematical corpus pre-training and the importance of diverse and challenging datasets.
Reach us at info@study.space
[slides and audio] GeoEval%3A Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving