Chain of Thoughtlessness? An Analysis of CoT in Planning

Chain of Thoughtlessness? An Analysis of CoT in Planning

6 Jun 2024 | Kaya Stechly*, Karthik Valmeekam*, Subbarao Kambhampati
This paper investigates the effectiveness of Chain of Thought (CoT) prompts in improving the performance of Large Language Models (LLMs) on classical planning problems, specifically in the domain of Blocksworld. The authors examine two state-of-the-art LLMs—GPT-4 and Claude-3-Opus—using various levels of specificity in CoT prompts and assess their performance across different problem complexities. The study finds that while CoT prompts can improve performance on very specific problem instances, the improvements are limited and do not generalize well to more complex or diverse problems. The results suggest that LLMs do not learn general algorithmic procedures through CoT prompts but rather rely on pattern matching and specific problem knowledge. The authors also extend their findings to scalable synthetic benchmarks, confirming similar limitations in generalization. Overall, the paper highlights the need for more rigorous evaluation methods and cautions against overestimating the generalizability of CoT prompts.This paper investigates the effectiveness of Chain of Thought (CoT) prompts in improving the performance of Large Language Models (LLMs) on classical planning problems, specifically in the domain of Blocksworld. The authors examine two state-of-the-art LLMs—GPT-4 and Claude-3-Opus—using various levels of specificity in CoT prompts and assess their performance across different problem complexities. The study finds that while CoT prompts can improve performance on very specific problem instances, the improvements are limited and do not generalize well to more complex or diverse problems. The results suggest that LLMs do not learn general algorithmic procedures through CoT prompts but rather rely on pattern matching and specific problem knowledge. The authors also extend their findings to scalable synthetic benchmarks, confirming similar limitations in generalization. Overall, the paper highlights the need for more rigorous evaluation methods and cautions against overestimating the generalizability of CoT prompts.
Reach us at info@study.space