3 Feb 2025 | Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, Anirudh Goyal
The paper presents a framework for generating challenging and diverse mathematics questions using a combination of large language models (LLMs) and human expertise. The authors leverage LLMs to extract core mathematical skills from existing datasets and then use these skills to generate novel questions that require the application of multiple skills. The process involves a multi-turn interaction between LLMs and humans, with LLMs generating questions and solutions, and human annotators refining them. The resulting dataset, named MATH², is significantly harder for LLMs compared to the original MATH dataset, as evidenced by lower performance on MATH² and higher performance when using MATH² questions as in-context examples. The study also reveals a quadratic relationship between the performance of models on MATH and MATH², suggesting that the difficulty of MATH² questions genuinely requires the application of two distinct skills. The paper discusses the limitations and future work, including the need for more efficient methods and automated validation tools to reduce costs and enhance the quality of generated questions.The paper presents a framework for generating challenging and diverse mathematics questions using a combination of large language models (LLMs) and human expertise. The authors leverage LLMs to extract core mathematical skills from existing datasets and then use these skills to generate novel questions that require the application of multiple skills. The process involves a multi-turn interaction between LLMs and humans, with LLMs generating questions and solutions, and human annotators refining them. The resulting dataset, named MATH², is significantly harder for LLMs compared to the original MATH dataset, as evidenced by lower performance on MATH² and higher performance when using MATH² questions as in-context examples. The study also reveals a quadratic relationship between the performance of models on MATH and MATH², suggesting that the difficulty of MATH² questions genuinely requires the application of two distinct skills. The paper discusses the limitations and future work, including the need for more efficient methods and automated validation tools to reduce costs and enhance the quality of generated questions.