The paper introduces REASONEval, a new methodology for evaluating the quality of reasoning steps in mathematical tasks, focusing on both validity and redundancy. REASONEval aims to address the limitations of current evaluation methods that primarily focus on final-answer accuracy, which can mask underlying issues such as logical errors or unnecessary steps. The methodology is designed to assess the correctness and efficiency of each step in the reasoning process, ensuring that the final answer is not the sole criterion for evaluation.
The paper presents a detailed design of REASONEval, including the task formulation, scoring scheme, and model architecture. It also discusses the training of high-quality evaluators using strong mathematical knowledge and specialized datasets. Experimental results show that REASONEval outperforms existing methods in detecting different types of errors and achieving state-of-the-art performance on human-labeled datasets. Additionally, the paper demonstrates the utility of REASONEval in evaluating different LLMs specialized in math and selecting high-quality training data to improve the efficiency and quality of solutions.
Key findings include:
1. Final-answer accuracy does not necessarily guarantee an improvement in the overall quality of reasoning steps.
2. Model scale, base model, and training methods significantly influence the quality of reasoning steps.
3. REASONEval can effectively select high-quality training data to enhance problem-solving efficiency and solution quality.
The paper concludes by highlighting the contributions of REASONEval and its potential for future research in evaluating and improving mathematical reasoning in large language models.The paper introduces REASONEval, a new methodology for evaluating the quality of reasoning steps in mathematical tasks, focusing on both validity and redundancy. REASONEval aims to address the limitations of current evaluation methods that primarily focus on final-answer accuracy, which can mask underlying issues such as logical errors or unnecessary steps. The methodology is designed to assess the correctness and efficiency of each step in the reasoning process, ensuring that the final answer is not the sole criterion for evaluation.
The paper presents a detailed design of REASONEval, including the task formulation, scoring scheme, and model architecture. It also discusses the training of high-quality evaluators using strong mathematical knowledge and specialized datasets. Experimental results show that REASONEval outperforms existing methods in detecting different types of errors and achieving state-of-the-art performance on human-labeled datasets. Additionally, the paper demonstrates the utility of REASONEval in evaluating different LLMs specialized in math and selecting high-quality training data to improve the efficiency and quality of solutions.
Key findings include:
1. Final-answer accuracy does not necessarily guarantee an improvement in the overall quality of reasoning steps.
2. Model scale, base model, and training methods significantly influence the quality of reasoning steps.
3. REASONEval can effectively select high-quality training data to enhance problem-solving efficiency and solution quality.
The paper concludes by highlighting the contributions of REASONEval and its potential for future research in evaluating and improving mathematical reasoning in large language models.