Understanding Evaluating Mathematical Reasoning Beyond Accuracy

The paper introduces REASONEVAL, a new methodology for evaluating the quality of reasoning steps in large language models (LLMs) for mathematical tasks. Unlike previous methods that focus solely on the final answer, REASONEVAL assesses both the validity (correctness) and redundancy (usefulness) of each step in the reasoning process. This approach helps identify logical errors, unnecessary steps, and inefficiencies in the reasoning process, which can affect the overall quality of the solution. REASONEVAL is implemented using a set of metrics and LLM-based evaluators that automatically assess the reasoning steps. The methodology is tested on human-labeled datasets and shows state-of-the-art performance in detecting different types of errors. The results demonstrate that an increase in final-answer accuracy does not necessarily lead to an improvement in the overall quality of the reasoning steps. Additionally, REASONEVAL is effective in selecting high-quality training data, which can improve the efficiency and quality of problem-solving. The paper also explores the impact of different factors, such as model size, base model, and training methods, on the quality of reasoning steps. It shows that the model scale and continued pretraining on math-related data are important for improving the ability to detect errors. Furthermore, the study highlights the importance of open-source and replicable evaluation metrics to ensure transparency and reliability in assessing mathematical reasoning. The results of the experiments show that REASONEVAL outperforms existing methods in detecting errors and provides a more comprehensive evaluation of the reasoning process. The methodology is also effective in selecting high-quality training data, which can improve the performance of LLMs in mathematical reasoning tasks. Overall, REASONEVAL provides a new framework for evaluating the quality of reasoning steps in LLMs, which can help improve the accuracy and efficiency of mathematical problem-solving.The paper introduces REASONEVAL, a new methodology for evaluating the quality of reasoning steps in large language models (LLMs) for mathematical tasks. Unlike previous methods that focus solely on the final answer, REASONEVAL assesses both the validity (correctness) and redundancy (usefulness) of each step in the reasoning process. This approach helps identify logical errors, unnecessary steps, and inefficiencies in the reasoning process, which can affect the overall quality of the solution. REASONEVAL is implemented using a set of metrics and LLM-based evaluators that automatically assess the reasoning steps. The methodology is tested on human-labeled datasets and shows state-of-the-art performance in detecting different types of errors. The results demonstrate that an increase in final-answer accuracy does not necessarily lead to an improvement in the overall quality of the reasoning steps. Additionally, REASONEVAL is effective in selecting high-quality training data, which can improve the efficiency and quality of problem-solving. The paper also explores the impact of different factors, such as model size, base model, and training methods, on the quality of reasoning steps. It shows that the model scale and continued pretraining on math-related data are important for improving the ability to detect errors. Furthermore, the study highlights the importance of open-source and replicable evaluation metrics to ensure transparency and reliability in assessing mathematical reasoning. The results of the experiments show that REASONEVAL outperforms existing methods in detecting errors and provides a more comprehensive evaluation of the reasoning process. The methodology is also effective in selecting high-quality training data, which can improve the performance of LLMs in mathematical reasoning tasks. Overall, REASONEVAL provides a new framework for evaluating the quality of reasoning steps in LLMs, which can help improve the accuracy and efficiency of mathematical problem-solving.

Evaluating Mathematical Reasoning Beyond Accuracy

8 Apr 2024 | Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, Pengfei Liu