26 Jun 2024 | Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, Kai Zou
The paper "MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data" by Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou from the University of Liverpool introduces a new dataset called "MathOdyssey" to evaluate the mathematical problem-solving capabilities of large language models (LLMs). The dataset includes a diverse range of mathematical problems at high school and university levels, created by experts to rigorously test LLMs in advanced problem-solving scenarios. The authors aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving by providing the dataset as a resource to the AI community.
The MathOdyssey dataset is designed to cover a wide range of subject areas, including Algebra, Number Theory, Geometry, Combinatorics, Pre-Calculus, Linear and Abstract Algebra, Calculus and Analysis, Differential Equations, Probability, and Statistics. Each problem in the dataset is accompanied by an answer and a detailed solution, ensuring objective evaluation of LLMs' outputs.
The paper conducts benchmarking on both open-source and closed-source models, including GPT-4 Turbo, GPT-4, GPT-3.5 Turbo, Gemini models, Claude 3, Llama-3-70B, and DBRX-Instruct. The results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. The performance gap between open-source and closed-source models is narrowing, but substantial challenges remain, particularly with the most demanding problems.
The study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs and provides a comprehensive benchmark dataset for future research. The dataset, results, and code are publicly available to facilitate further exploration and improvement in AI capabilities in complex mathematical problem-solving.The paper "MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data" by Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou from the University of Liverpool introduces a new dataset called "MathOdyssey" to evaluate the mathematical problem-solving capabilities of large language models (LLMs). The dataset includes a diverse range of mathematical problems at high school and university levels, created by experts to rigorously test LLMs in advanced problem-solving scenarios. The authors aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving by providing the dataset as a resource to the AI community.
The MathOdyssey dataset is designed to cover a wide range of subject areas, including Algebra, Number Theory, Geometry, Combinatorics, Pre-Calculus, Linear and Abstract Algebra, Calculus and Analysis, Differential Equations, Probability, and Statistics. Each problem in the dataset is accompanied by an answer and a detailed solution, ensuring objective evaluation of LLMs' outputs.
The paper conducts benchmarking on both open-source and closed-source models, including GPT-4 Turbo, GPT-4, GPT-3.5 Turbo, Gemini models, Claude 3, Llama-3-70B, and DBRX-Instruct. The results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. The performance gap between open-source and closed-source models is narrowing, but substantial challenges remain, particularly with the most demanding problems.
The study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs and provides a comprehensive benchmark dataset for future research. The dataset, results, and code are publicly available to facilitate further exploration and improvement in AI capabilities in complex mathematical problem-solving.