10 Jan 2023 | Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou
This paper explores how chain-of-thought prompting enhances the reasoning abilities of large language models (LLMs). Chain-of-thought prompting involves providing a few examples that include intermediate reasoning steps, which helps LLMs generate logical sequences of thought to solve complex tasks. Experiments on three large LLMs show that this method significantly improves performance on arithmetic, commonsense, and symbolic reasoning tasks. For example, PaLM 540B with chain-of-thought prompting achieves state-of-the-art results on the GSM8K benchmark, surpassing even finetuned GPT-3.
Standard prompting often leads to incorrect answers, while chain-of-thought prompting provides a coherent sequence of reasoning steps that lead to the correct answer. For instance, in a math problem, the model explains each step of the calculation, leading to the correct result. This method is effective for tasks requiring multi-step reasoning, such as arithmetic and commonsense problems.
The paper discusses the benefits of chain-of-thought prompting, including its ability to decompose complex problems into intermediate steps, provide interpretable reasoning paths, and apply to various tasks like math word problems, commonsense reasoning, and symbolic manipulation. It also highlights that sufficiently large LLMs can generate these reasoning steps when provided with examples.
Experiments show that chain-of-thought prompting outperforms standard prompting, especially for complex tasks. For example, PaLM 540B with chain-of-thought prompting achieves new state-of-the-art performance on the GSM8K benchmark. The method is robust across different annotators and exemplars, and it enables generalization to longer sequences in symbolic reasoning tasks.
The paper also discusses limitations, such as the need for sufficient model size to achieve effective reasoning and the challenge of ensuring correct reasoning paths. It concludes that chain-of-thought prompting is a simple and effective method for enhancing reasoning in LLMs, with potential applications across various domains. The results suggest that standard prompting only provides a lower bound on the capabilities of large language models, and further research is needed to explore the full potential of reasoning abilities in LLMs.This paper explores how chain-of-thought prompting enhances the reasoning abilities of large language models (LLMs). Chain-of-thought prompting involves providing a few examples that include intermediate reasoning steps, which helps LLMs generate logical sequences of thought to solve complex tasks. Experiments on three large LLMs show that this method significantly improves performance on arithmetic, commonsense, and symbolic reasoning tasks. For example, PaLM 540B with chain-of-thought prompting achieves state-of-the-art results on the GSM8K benchmark, surpassing even finetuned GPT-3.
Standard prompting often leads to incorrect answers, while chain-of-thought prompting provides a coherent sequence of reasoning steps that lead to the correct answer. For instance, in a math problem, the model explains each step of the calculation, leading to the correct result. This method is effective for tasks requiring multi-step reasoning, such as arithmetic and commonsense problems.
The paper discusses the benefits of chain-of-thought prompting, including its ability to decompose complex problems into intermediate steps, provide interpretable reasoning paths, and apply to various tasks like math word problems, commonsense reasoning, and symbolic manipulation. It also highlights that sufficiently large LLMs can generate these reasoning steps when provided with examples.
Experiments show that chain-of-thought prompting outperforms standard prompting, especially for complex tasks. For example, PaLM 540B with chain-of-thought prompting achieves new state-of-the-art performance on the GSM8K benchmark. The method is robust across different annotators and exemplars, and it enables generalization to longer sequences in symbolic reasoning tasks.
The paper also discusses limitations, such as the need for sufficient model size to achieve effective reasoning and the challenge of ensuring correct reasoning paths. It concludes that chain-of-thought prompting is a simple and effective method for enhancing reasoning in LLMs, with potential applications across various domains. The results suggest that standard prompting only provides a lower bound on the capabilities of large language models, and further research is needed to explore the full potential of reasoning abilities in LLMs.