MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models

MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models

20 Jun 2024 | Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia
MR-BEN is a comprehensive benchmark for evaluating the reasoning capabilities of large language models (LLMs). It requires models to identify and analyze errors in automatically generated reasoning steps, offering a more thorough assessment than existing outcome-based benchmarks. The benchmark includes 5,975 questions across various subjects, including natural sciences, coding, and logic, and is designed to evaluate both the correctness of answers and the ability to detect and correct reasoning errors. The benchmark employs a meta-reasoning paradigm where LLMs act as teachers, assessing the correctness of reasoning processes and identifying potential errors. The benchmark includes a detailed dataset with questions, chain-of-thought answers, and error analyses. It is evaluated using metrics such as the Matthews Correlation Coefficient (MCC), accuracy in identifying the first error step, and accuracy in identifying the error reason. The benchmark reveals that while some models perform well on outcome-based benchmarks, they often struggle with identifying and correcting reasoning errors. This highlights the importance of evaluating not just the final answer but also the reasoning process itself. The benchmark also explores the impact of model size on performance, showing that larger models generally perform better, but that techniques like knowledge distillation can also significantly improve performance. Additionally, the benchmark demonstrates that different models excel in different reasoning paradigms, challenging the assumption that domain-specific improvements necessarily lead to broad cognitive improvements. The benchmark provides a comprehensive evaluation of LLMs' reasoning abilities, highlighting the need for further research to enhance their reasoning capabilities. The dataset and code are available for public use.MR-BEN is a comprehensive benchmark for evaluating the reasoning capabilities of large language models (LLMs). It requires models to identify and analyze errors in automatically generated reasoning steps, offering a more thorough assessment than existing outcome-based benchmarks. The benchmark includes 5,975 questions across various subjects, including natural sciences, coding, and logic, and is designed to evaluate both the correctness of answers and the ability to detect and correct reasoning errors. The benchmark employs a meta-reasoning paradigm where LLMs act as teachers, assessing the correctness of reasoning processes and identifying potential errors. The benchmark includes a detailed dataset with questions, chain-of-thought answers, and error analyses. It is evaluated using metrics such as the Matthews Correlation Coefficient (MCC), accuracy in identifying the first error step, and accuracy in identifying the error reason. The benchmark reveals that while some models perform well on outcome-based benchmarks, they often struggle with identifying and correcting reasoning errors. This highlights the importance of evaluating not just the final answer but also the reasoning process itself. The benchmark also explores the impact of model size on performance, showing that larger models generally perform better, but that techniques like knowledge distillation can also significantly improve performance. Additionally, the benchmark demonstrates that different models excel in different reasoning paradigms, challenging the assumption that domain-specific improvements necessarily lead to broad cognitive improvements. The benchmark provides a comprehensive evaluation of LLMs' reasoning abilities, highlighting the need for further research to enhance their reasoning capabilities. The dataset and code are available for public use.
Reach us at info@study.space