6 Jun 2024 | Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati
This paper analyzes the effectiveness of chain of thought (CoT) prompting in large language models (LLMs) for solving planning problems, particularly in the Blocksworld domain. The study challenges the assumption that CoT improves LLM performance by teaching general reasoning algorithms. Instead, it shows that CoT's effectiveness depends on the specificity of the prompts and the similarity between the examples and the query. When prompts are overly general or not closely aligned with the problem, performance drops significantly. The results suggest that CoT does not enable LLMs to learn generalizable reasoning procedures but rather relies on pattern matching based on the specific examples provided.
The study evaluates two main axes: the generality of examples in the prompt and the complexity of the problem. While CoT can improve performance on simple problems, its effectiveness diminishes as the problem size increases. This is especially true for planning tasks, where the complexity of the problem grows with the number of blocks. The study also extends the findings to three scalable synthetic benchmarks—CoinFlip, LastLetterConcatenation, and multi-step arithmetic—to test the generalizability of CoT. In all cases, CoT's performance improvements are limited and degrade as the problem becomes more complex.
The paper highlights the trade-off between the potential performance gains from CoT and the significant human effort required to create specific, problem-tailored prompts. It argues that CoT does not enable LLMs to learn general reasoning algorithms but instead relies on pattern matching based on the specific examples provided. This suggests that the improvements observed with CoT are not indicative of the LLM's general in-context learning abilities but rather the effectiveness of the specific prompts used.
The study also demonstrates that even advanced CoT techniques, such as self-consistency, do not consistently improve performance across different problem types. This indicates that the effectiveness of CoT is problem-specific and that the true reasoning capabilities of LLMs may be underestimates in previous studies. The findings call for more rigorous evaluation of LLMs on benchmarks that can generate arbitrary new instances of increasing difficulty, rather than relying on static test sets. The paper concludes that CoT does not enable LLMs to learn generalizable reasoning procedures but rather relies on pattern matching based on the specific examples provided.This paper analyzes the effectiveness of chain of thought (CoT) prompting in large language models (LLMs) for solving planning problems, particularly in the Blocksworld domain. The study challenges the assumption that CoT improves LLM performance by teaching general reasoning algorithms. Instead, it shows that CoT's effectiveness depends on the specificity of the prompts and the similarity between the examples and the query. When prompts are overly general or not closely aligned with the problem, performance drops significantly. The results suggest that CoT does not enable LLMs to learn generalizable reasoning procedures but rather relies on pattern matching based on the specific examples provided.
The study evaluates two main axes: the generality of examples in the prompt and the complexity of the problem. While CoT can improve performance on simple problems, its effectiveness diminishes as the problem size increases. This is especially true for planning tasks, where the complexity of the problem grows with the number of blocks. The study also extends the findings to three scalable synthetic benchmarks—CoinFlip, LastLetterConcatenation, and multi-step arithmetic—to test the generalizability of CoT. In all cases, CoT's performance improvements are limited and degrade as the problem becomes more complex.
The paper highlights the trade-off between the potential performance gains from CoT and the significant human effort required to create specific, problem-tailored prompts. It argues that CoT does not enable LLMs to learn general reasoning algorithms but instead relies on pattern matching based on the specific examples provided. This suggests that the improvements observed with CoT are not indicative of the LLM's general in-context learning abilities but rather the effectiveness of the specific prompts used.
The study also demonstrates that even advanced CoT techniques, such as self-consistency, do not consistently improve performance across different problem types. This indicates that the effectiveness of CoT is problem-specific and that the true reasoning capabilities of LLMs may be underestimates in previous studies. The findings call for more rigorous evaluation of LLMs on benchmarks that can generate arbitrary new instances of increasing difficulty, rather than relying on static test sets. The paper concludes that CoT does not enable LLMs to learn generalizable reasoning procedures but rather relies on pattern matching based on the specific examples provided.