Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

2024 | Ankit Satpute, Noah Gießing, André Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp
This paper investigates the ability of Large Language Models (LLMs) to answer mathematical questions, focusing on their performance on the Math Stack Exchange (MSE) platform. The study evaluates six LLMs, including GPT-4, ToRA, LLeMa, MAmmoTH, MABOWDOR, and Mistral 7B, using two scenarios: answer generation and question-answer comparison. The results show that GPT-4 performs best, achieving an nDCG of 0.48 and P@10 of 0.37, outperforming other models on the ArqMATH3 Task1. However, GPT-4 does not consistently answer all questions accurately, indicating limitations in its mathematical reasoning capabilities. A case study reveals that while GPT-4 can generate relevant responses in some cases, it often fails to provide accurate answers for complex mathematical problems. The study also highlights the challenges of evaluating LLMs in mathematical reasoning, particularly in handling open-ended questions. The authors make their code and findings publicly available for further research. The paper underscores the importance of continued research in improving LLMs' ability to understand and solve mathematical problems.This paper investigates the ability of Large Language Models (LLMs) to answer mathematical questions, focusing on their performance on the Math Stack Exchange (MSE) platform. The study evaluates six LLMs, including GPT-4, ToRA, LLeMa, MAmmoTH, MABOWDOR, and Mistral 7B, using two scenarios: answer generation and question-answer comparison. The results show that GPT-4 performs best, achieving an nDCG of 0.48 and P@10 of 0.37, outperforming other models on the ArqMATH3 Task1. However, GPT-4 does not consistently answer all questions accurately, indicating limitations in its mathematical reasoning capabilities. A case study reveals that while GPT-4 can generate relevant responses in some cases, it often fails to provide accurate answers for complex mathematical problems. The study also highlights the challenges of evaluating LLMs in mathematical reasoning, particularly in handling open-ended questions. The authors make their code and findings publicly available for further research. The paper underscores the importance of continued research in improving LLMs' ability to understand and solve mathematical problems.
Reach us at info@study.space
Understanding Can LLMs Master Math%3F Investigating Large Language Models on Math Stack Exchange