On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

3 Aug 2024 | Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati
This paper investigates the effectiveness of iterative prompting in reasoning and planning tasks using Large Language Models (LLMs), focusing on GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. The study explores whether LLMs can self-critique and improve their solutions through iterative prompting, and whether external verification can enhance performance. The results show that self-critique often leads to performance degradation, while external verification significantly improves accuracy. The study also finds that re-prompting with a sound verifier maintains most of the benefits of more complex setups. The paper highlights the limitations of LLM self-critique, showing that the verifier's false negative rate is significant, leading to performance issues. It contrasts this with a system using an external sound verifier, which provides substantial performance gains. The study concludes that LLMs are not reliable for self-critique and that external verification is more effective. The findings suggest that future LLM applications for reasoning tasks should use external verification systems rather than relying on self-critique. The paper also emphasizes the importance of formal verification in testing LLM reasoning capabilities and the need for more rigorous benchmarks to assess LLM performance accurately.This paper investigates the effectiveness of iterative prompting in reasoning and planning tasks using Large Language Models (LLMs), focusing on GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. The study explores whether LLMs can self-critique and improve their solutions through iterative prompting, and whether external verification can enhance performance. The results show that self-critique often leads to performance degradation, while external verification significantly improves accuracy. The study also finds that re-prompting with a sound verifier maintains most of the benefits of more complex setups. The paper highlights the limitations of LLM self-critique, showing that the verifier's false negative rate is significant, leading to performance issues. It contrasts this with a system using an external sound verifier, which provides substantial performance gains. The study concludes that LLMs are not reliable for self-critique and that external verification is more effective. The findings suggest that future LLM applications for reasoning tasks should use external verification systems rather than relying on self-critique. The paper also emphasizes the importance of formal verification in testing LLM reasoning capabilities and the need for more rigorous benchmarks to assess LLM performance accurately.
Reach us at info@study.space