CRITICBENCH: Benchmarking LLMs for Critique-Correct Reasoning

CRITICBENCH: Benchmarking LLMs for Critique-Correct Reasoning

1 Jun 2024 | Zicheng Lin1*, Zhibin Gou1*, Tian Liang1, Ruilin Luo1, Haowei Liu2, Yujiu Yang1†
The paper introduces CRITICBENCH, a comprehensive benchmark designed to evaluate the critical reasoning abilities of Large Language Models (LLMs) in generating, critiquing, and correcting their responses across five domains: mathematical, commonsense, symbolic, coding, and algorithmic. The benchmark includes 15 datasets and incorporates responses from three LLM families. The study evaluates 17 LLMs, including closed-source models like GPT-3.5 and GPT-4, open-source models, and models specifically trained for critiquing. Key findings include: 1. **Linear Relationship**: LLMs exhibit a linear relationship in their generation, critique, and correction (GQC) capabilities, with critique-focused training significantly enhancing performance. 2. **Task Dependency**: The effectiveness of critique and correction varies by task type, with logic-oriented tasks being more amenable to correction. 3. **Knowledge Inconsistencies**: GQC knowledge inconsistencies decrease as model size increases. 4. **Inter-Model Critiquing**: Stronger models are better at critiquing weaker models, while weaker models can surprisingly surpass stronger ones in self-critique. The paper aims to foster further research in LLM critique and self-improvement by providing insights into the nuanced critique-correct reasoning of LLMs.The paper introduces CRITICBENCH, a comprehensive benchmark designed to evaluate the critical reasoning abilities of Large Language Models (LLMs) in generating, critiquing, and correcting their responses across five domains: mathematical, commonsense, symbolic, coding, and algorithmic. The benchmark includes 15 datasets and incorporates responses from three LLM families. The study evaluates 17 LLMs, including closed-source models like GPT-3.5 and GPT-4, open-source models, and models specifically trained for critiquing. Key findings include: 1. **Linear Relationship**: LLMs exhibit a linear relationship in their generation, critique, and correction (GQC) capabilities, with critique-focused training significantly enhancing performance. 2. **Task Dependency**: The effectiveness of critique and correction varies by task type, with logic-oriented tasks being more amenable to correction. 3. **Knowledge Inconsistencies**: GQC knowledge inconsistencies decrease as model size increases. 4. **Inter-Model Critiquing**: Stronger models are better at critiquing weaker models, while weaker models can surprisingly surpass stronger ones in self-critique. The paper aims to foster further research in LLM critique and self-improvement by providing insights into the nuanced critique-correct reasoning of LLMs.
Reach us at info@study.space