The paper introduces CRITICBENCH, a comprehensive benchmark designed to evaluate the critical reasoning abilities of Large Language Models (LLMs) in generating, critiquing, and correcting their responses across five domains: mathematical, commonsense, symbolic, coding, and algorithmic. The benchmark includes 15 datasets and incorporates responses from three LLM families. The study evaluates 17 LLMs, including closed-source models like GPT-3.5 and GPT-4, open-source models, and models specifically trained for critiquing. Key findings include:
1. **Linear Relationship**: LLMs exhibit a linear relationship in their generation, critique, and correction (GQC) capabilities, with critique-focused training significantly enhancing performance.
2. **Task Dependency**: The effectiveness of critique and correction varies by task type, with logic-oriented tasks being more amenable to correction.
3. **Knowledge Inconsistencies**: GQC knowledge inconsistencies decrease as model size increases.
4. **Inter-Model Critiquing**: Stronger models are better at critiquing weaker models, while weaker models can surprisingly surpass stronger ones in self-critique.
The paper aims to foster further research in LLM critique and self-improvement by providing insights into the nuanced critique-correct reasoning of LLMs.The paper introduces CRITICBENCH, a comprehensive benchmark designed to evaluate the critical reasoning abilities of Large Language Models (LLMs) in generating, critiquing, and correcting their responses across five domains: mathematical, commonsense, symbolic, coding, and algorithmic. The benchmark includes 15 datasets and incorporates responses from three LLM families. The study evaluates 17 LLMs, including closed-source models like GPT-3.5 and GPT-4, open-source models, and models specifically trained for critiquing. Key findings include:
1. **Linear Relationship**: LLMs exhibit a linear relationship in their generation, critique, and correction (GQC) capabilities, with critique-focused training significantly enhancing performance.
2. **Task Dependency**: The effectiveness of critique and correction varies by task type, with logic-oriented tasks being more amenable to correction.
3. **Knowledge Inconsistencies**: GQC knowledge inconsistencies decrease as model size increases.
4. **Inter-Model Critiquing**: Stronger models are better at critiquing weaker models, while weaker models can surprisingly surpass stronger ones in self-critique.
The paper aims to foster further research in LLM critique and self-improvement by providing insights into the nuanced critique-correct reasoning of LLMs.