Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

2024 | Ankit Satpute†§, Noah Gießing†, André Greiner-Petter§, Moritz Schubotz†, Olaf Teschke†, Akiko Aizawa‡, Bela Gipp§
This paper investigates the capabilities of Large Language Models (LLMs) in answering mathematical questions from the Math Stack Exchange (MSE). The study employs a two-step approach: first, it evaluates the performance of LLMs on 78 MSE questions using benchmarks; second, it conducts a case analysis on the best-performing LLM, GPT-4, to assess the quality and accuracy of its answers through manual evaluation. The results show that GPT-4 outperforms other LLMs, achieving an nDCG of 0.48 and P@10 of 0.37, and surpasses the current best approach on the ArqMATH3 Task1. However, the case analysis reveals that while GPT-4 can generate relevant responses in some instances, it does not consistently answer all questions accurately, particularly for more complex questions requiring specialized knowledge. The paper highlights the current limitations of LLMs in handling complex mathematical problem-solving and sets the stage for future research and advancements in AI-driven mathematical reasoning. The code and findings are publicly available for further research.This paper investigates the capabilities of Large Language Models (LLMs) in answering mathematical questions from the Math Stack Exchange (MSE). The study employs a two-step approach: first, it evaluates the performance of LLMs on 78 MSE questions using benchmarks; second, it conducts a case analysis on the best-performing LLM, GPT-4, to assess the quality and accuracy of its answers through manual evaluation. The results show that GPT-4 outperforms other LLMs, achieving an nDCG of 0.48 and P@10 of 0.37, and surpasses the current best approach on the ArqMATH3 Task1. However, the case analysis reveals that while GPT-4 can generate relevant responses in some instances, it does not consistently answer all questions accurately, particularly for more complex questions requiring specialized knowledge. The paper highlights the current limitations of LLMs in handling complex mathematical problem-solving and sets the stage for future research and advancements in AI-driven mathematical reasoning. The code and findings are publicly available for further research.
Reach us at info@study.space