[slides and audio] MR-Ben%3A A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

The paper introduces Mr-BEN, a comprehensive benchmark for evaluating the meta-reasoning capabilities of large language models (LLMs). Mr-BEN is designed to address the limitations of existing outcome-based benchmarks by focusing on the process of reasoning rather than just the final answer. The benchmark consists of 5,975 questions from various subjects, including physics, chemistry, logic, and coding, and requires LLMs to identify and analyze potential errors in their reasoning steps. The evaluation metrics used are designed to assess the quality of the reasoning process, including solution correctness, the identification of the first error step, and the accuracy of the error reason. The paper reports that while many LLMs can generate correct answers, they often struggle to identify and correct errors in their reasoning. The analysis reveals distinct limitations and weaknesses in the reasoning abilities of different models, highlighting the need for techniques like high-quality synthetic data to enhance performance. The benchmark also shows that different models excel in different reasoning paradigms, challenging the assumption that domain-specific enhancements lead to broad improvements. The paper concludes by discussing the potential societal impacts of Mr-BEN and suggesting future research directions to improve LLMs' reasoning capabilities.The paper introduces Mr-BEN, a comprehensive benchmark for evaluating the meta-reasoning capabilities of large language models (LLMs). Mr-BEN is designed to address the limitations of existing outcome-based benchmarks by focusing on the process of reasoning rather than just the final answer. The benchmark consists of 5,975 questions from various subjects, including physics, chemistry, logic, and coding, and requires LLMs to identify and analyze potential errors in their reasoning steps. The evaluation metrics used are designed to assess the quality of the reasoning process, including solution correctness, the identification of the first error step, and the accuracy of the error reason. The paper reports that while many LLMs can generate correct answers, they often struggle to identify and correct errors in their reasoning. The analysis reveals distinct limitations and weaknesses in the reasoning abilities of different models, highlighting the need for techniques like high-quality synthetic data to enhance performance. The benchmark also shows that different models excel in different reasoning paradigms, challenging the assumption that domain-specific enhancements lead to broad improvements. The paper concludes by discussing the potential societal impacts of Mr-BEN and suggesting future research directions to improve LLMs' reasoning capabilities.

MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models

20 Jun 2024 | Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia