GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

17 May 2024 | Jiaxin Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu, Yashar Moshfeghi
GeoEval is a benchmark designed to evaluate the geometry problem-solving capabilities of large language models (LLMs) and multi-modal models (MMs). It includes four subsets: GeoEval-2000 (2,000 problems), GeoEval-backward (750 problems), GeoEval-aug (2,000 problems), and GeoEval-hard (300 problems). The benchmark features comprehensive variety, varied problems, dual inputs, diverse challenges, and complexity ratings, enabling a thorough assessment of models' performance in solving geometry math problems. The benchmark was evaluated against ten LLMs and MMs, including models like WizardMath-70B, GPT-3.5, GPT-4, and several multi-modal models. Results showed that WizardMath-70B achieved the highest accuracy on the main subset (55.67%) but struggled with the hardest subset (6.00%). GPT-series models performed better on problems they had rephrased, suggesting that rephrasing can enhance model performance. Additionally, models with access to geometric diagrams performed better, highlighting the importance of visual information in solving geometry problems. The benchmark also revealed that models trained on mathematical corpora outperformed those trained on general data, particularly in solving geometry problems. However, even these models struggled with the most challenging problems, indicating the need for further improvements in reasoning capabilities. The GeoEval benchmark provides a comprehensive and challenging assessment of LLMs and MMs in geometry problem-solving, offering insights into their strengths and weaknesses. The results emphasize the importance of mathematical pre-training and the value of incorporating geometric diagram descriptions in solving geometry problems.GeoEval is a benchmark designed to evaluate the geometry problem-solving capabilities of large language models (LLMs) and multi-modal models (MMs). It includes four subsets: GeoEval-2000 (2,000 problems), GeoEval-backward (750 problems), GeoEval-aug (2,000 problems), and GeoEval-hard (300 problems). The benchmark features comprehensive variety, varied problems, dual inputs, diverse challenges, and complexity ratings, enabling a thorough assessment of models' performance in solving geometry math problems. The benchmark was evaluated against ten LLMs and MMs, including models like WizardMath-70B, GPT-3.5, GPT-4, and several multi-modal models. Results showed that WizardMath-70B achieved the highest accuracy on the main subset (55.67%) but struggled with the hardest subset (6.00%). GPT-series models performed better on problems they had rephrased, suggesting that rephrasing can enhance model performance. Additionally, models with access to geometric diagrams performed better, highlighting the importance of visual information in solving geometry problems. The benchmark also revealed that models trained on mathematical corpora outperformed those trained on general data, particularly in solving geometry problems. However, even these models struggled with the most challenging problems, indicating the need for further improvements in reasoning capabilities. The GeoEval benchmark provides a comprehensive and challenging assessment of LLMs and MMs in geometry problem-solving, offering insights into their strengths and weaknesses. The results emphasize the importance of mathematical pre-training and the value of incorporating geometric diagram descriptions in solving geometry problems.
Reach us at info@study.space