Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset

8 Nov 2021 | Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt
The MATH dataset is a new benchmark for evaluating mathematical problem-solving abilities in machine learning models. It consists of 12,500 challenging competition mathematics problems with full step-by-step solutions, enabling models to learn how to generate answer derivations and explanations. To improve performance on MATH, the authors also introduce a large auxiliary pretraining dataset, AMPS, which contains over 100,000 problems from Khan Academy and 5 million problems generated using Mathematica scripts. This dataset helps teach models the fundamentals of mathematics. Despite these efforts, accuracy on MATH remains relatively low, even with large Transformer models. The authors find that increasing model size and parameters alone is not sufficient to achieve strong mathematical reasoning. They also show that models can benefit from step-by-step solutions during training, but generating their own step-by-step solutions during inference does not always improve accuracy. Additionally, providing models with partial step-by-step solutions during inference can improve performance. The MATH dataset is challenging because it requires models to understand and solve complex mathematical problems, which is different from other text-based tasks. The authors also show that even with pretraining on AMPS, a 0.1B parameter model can perform similarly to a 13B parameter model fine-tuned on MATH. However, accuracy on MATH increases very slowly with model size, suggesting that algorithmic improvements may be needed to achieve strong performance. The authors also evaluate human performance on MATH and find that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%. This indicates that MATH can be challenging for humans as well. The authors conclude that solving MATH requires not just larger models, but also new algorithmic advancements from the broader research community.The MATH dataset is a new benchmark for evaluating mathematical problem-solving abilities in machine learning models. It consists of 12,500 challenging competition mathematics problems with full step-by-step solutions, enabling models to learn how to generate answer derivations and explanations. To improve performance on MATH, the authors also introduce a large auxiliary pretraining dataset, AMPS, which contains over 100,000 problems from Khan Academy and 5 million problems generated using Mathematica scripts. This dataset helps teach models the fundamentals of mathematics. Despite these efforts, accuracy on MATH remains relatively low, even with large Transformer models. The authors find that increasing model size and parameters alone is not sufficient to achieve strong mathematical reasoning. They also show that models can benefit from step-by-step solutions during training, but generating their own step-by-step solutions during inference does not always improve accuracy. Additionally, providing models with partial step-by-step solutions during inference can improve performance. The MATH dataset is challenging because it requires models to understand and solve complex mathematical problems, which is different from other text-based tasks. The authors also show that even with pretraining on AMPS, a 0.1B parameter model can perform similarly to a 13B parameter model fine-tuned on MATH. However, accuracy on MATH increases very slowly with model size, suggesting that algorithmic improvements may be needed to achieve strong performance. The authors also evaluate human performance on MATH and find that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%. This indicates that MATH can be challenging for humans as well. The authors conclude that solving MATH requires not just larger models, but also new algorithmic advancements from the broader research community.
Reach us at info@study.space