3 Feb 2025 | Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, Anirudh Goyal
This paper presents a framework for generating challenging mathematics questions by combining the strengths of large language models (LLMs) and human expertise. The approach involves extracting core mathematical skills from existing datasets, such as the MATH dataset, and using them to generate questions that require the combination of two distinct skills. This process ensures that the generated questions are diverse and difficult, as they require "out of distribution" thinking. The framework includes a pipeline that iteratively generates and refines questions using LLMs and human annotators, who verify and improve the questions based on their expertise.
The resulting dataset, called MATH², is significantly harder than the original MATH dataset. The performance of models on MATH² is observed to be the square of their performance on MATH, indicating that solving questions in MATH² requires a nontrivial combination of two distinct mathematical skills. This suggests that MATH² is a more effective benchmark for evaluating mathematical reasoning capabilities.
The pipeline involves several steps: skill extraction, question generation, solution attempts, question validation, and final solution refinement. Human annotators play a crucial role in ensuring the quality and difficulty of the generated questions. The process also includes using the generated questions as in-context examples, which significantly improves model performance on the MATH dataset.
The study highlights the importance of human oversight in refining LLM-generated questions to ensure they are challenging, creative, and suitable for advanced mathematical problem-solving. The framework demonstrates that combining AI and human expertise can lead to the creation of high-quality, diverse, and difficult mathematics questions, which are essential for evaluating and improving the capabilities of large language models.This paper presents a framework for generating challenging mathematics questions by combining the strengths of large language models (LLMs) and human expertise. The approach involves extracting core mathematical skills from existing datasets, such as the MATH dataset, and using them to generate questions that require the combination of two distinct skills. This process ensures that the generated questions are diverse and difficult, as they require "out of distribution" thinking. The framework includes a pipeline that iteratively generates and refines questions using LLMs and human annotators, who verify and improve the questions based on their expertise.
The resulting dataset, called MATH², is significantly harder than the original MATH dataset. The performance of models on MATH² is observed to be the square of their performance on MATH, indicating that solving questions in MATH² requires a nontrivial combination of two distinct mathematical skills. This suggests that MATH² is a more effective benchmark for evaluating mathematical reasoning capabilities.
The pipeline involves several steps: skill extraction, question generation, solution attempts, question validation, and final solution refinement. Human annotators play a crucial role in ensuring the quality and difficulty of the generated questions. The process also includes using the generated questions as in-context examples, which significantly improves model performance on the MATH dataset.
The study highlights the importance of human oversight in refining LLM-generated questions to ensure they are challenging, creative, and suitable for advanced mathematical problem-solving. The framework demonstrates that combining AI and human expertise can lead to the creation of high-quality, diverse, and difficult mathematics questions, which are essential for evaluating and improving the capabilities of large language models.