1 Jun 2024 | Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, Yujiu Yang
CRITICBENCH is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to critique and correct their reasoning across various tasks. The benchmark includes five domains: mathematical, commonsense, symbolic, coding, and algorithmic, and compiles 15 datasets. It incorporates responses from three LLM families and evaluates 17 LLMs, including GPT-3.5, GPT-4, Phi-2, LLaMA, Vicuna, and Mistral. The benchmark assesses generation, critique, and correction reasoning (GQC) and reveals key findings: (1) a linear relationship in GQC capabilities with critique-focused training improving performance; (2) task-dependent variation in critique and correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing pattern where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in self-critique. The study highlights the importance of evaluating LLMs' critique and correction abilities systematically to improve their self-improvement and evaluation capabilities. The findings suggest that while stronger models generally perform better in critique, weaker models can sometimes outperform stronger ones in self-critique, indicating the need for further research into LLM critique and self-improvement. The benchmark provides a comprehensive and comparative analysis of LLMs' abilities in generation, critique, and correction, offering insights into the nuanced critique-correct reasoning of LLMs.CRITICBENCH is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to critique and correct their reasoning across various tasks. The benchmark includes five domains: mathematical, commonsense, symbolic, coding, and algorithmic, and compiles 15 datasets. It incorporates responses from three LLM families and evaluates 17 LLMs, including GPT-3.5, GPT-4, Phi-2, LLaMA, Vicuna, and Mistral. The benchmark assesses generation, critique, and correction reasoning (GQC) and reveals key findings: (1) a linear relationship in GQC capabilities with critique-focused training improving performance; (2) task-dependent variation in critique and correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing pattern where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in self-critique. The study highlights the importance of evaluating LLMs' critique and correction abilities systematically to improve their self-improvement and evaluation capabilities. The findings suggest that while stronger models generally perform better in critique, weaker models can sometimes outperform stronger ones in self-critique, indicating the need for further research into LLM critique and self-improvement. The benchmark provides a comprehensive and comparative analysis of LLMs' abilities in generation, critique, and correction, offering insights into the nuanced critique-correct reasoning of LLMs.