Understanding On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

This paper investigates the self-verification capabilities of Large Language Models (LLMs) on reasoning and planning tasks. Despite initial optimism, there is growing evidence that LLMs struggle with these tasks, particularly in generating correct solutions and self-critiquing. The authors conduct a systematic empirical study using GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. They compare the performance of LLMs when critiquing their own answers versus using an external sound reasoner. The results show that self-critique significantly degrades performance, while external verification improves it. The study also finds that the quality of feedback from the LLMs is often poor, leading to compounding errors. The authors conclude that future implementations should rely on external sound systems for verification, rather than relying on opaque self-critique mechanisms. The findings contradict earlier optimism about LLM self-critique and highlight the need for more robust and reliable verification methods.This paper investigates the self-verification capabilities of Large Language Models (LLMs) on reasoning and planning tasks. Despite initial optimism, there is growing evidence that LLMs struggle with these tasks, particularly in generating correct solutions and self-critiquing. The authors conduct a systematic empirical study using GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. They compare the performance of LLMs when critiquing their own answers versus using an external sound reasoner. The results show that self-critique significantly degrades performance, while external verification improves it. The study also finds that the quality of feedback from the LLMs is often poor, leading to compounding errors. The authors conclude that future implementations should rely on external sound systems for verification, rather than relying on opaque self-critique mechanisms. The findings contradict earlier optimism about LLM self-critique and highlight the need for more robust and reliable verification methods.

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

3 Aug 2024 | Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati