22 Jun 2024 | Mingyu Jin, Qinkai Yu, Shu Dong, Haiyan Zhao, Wenyue Hua, Yanda Meng, Yongfeng Zhang, Mengnan Du
This paper investigates the impact of reasoning step length on the performance of large language models (LLMs) using Chain of Thought (CoT) prompting. The study reveals that the number of reasoning steps in CoT prompts significantly influences LLM performance. Key findings include a direct linear correlation between step count and accuracy in few-shot CoT, indicating that increasing reasoning steps enhances LLM reasoning abilities across multiple datasets. Even incorrect rationales can yield favorable outcomes if they maintain the required length of inference. The advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, while complex tasks benefit significantly from longer inference sequences. Zero-shot CoT also benefits from increased reasoning steps, as demonstrated by modifying the initial prompt to encourage more extensive thinking. The study further shows that compressing reasoning steps in few-shot demonstrations can hurt LLM performance, regressing to levels similar to zero-shot methods. Additionally, the required reasoning steps are related to the size of the LLMs, with larger models showing greater tolerance to increased reasoning steps. The research also highlights that the nature of the questions themselves has a lesser impact on performance compared to the length of the reasoning steps. These findings provide valuable insights into optimizing CoT strategies for complex NLP tasks. The study underscores the importance of reasoning step length in enhancing LLM performance and suggests that further research is needed to understand the underlying mechanisms of CoT prompting.This paper investigates the impact of reasoning step length on the performance of large language models (LLMs) using Chain of Thought (CoT) prompting. The study reveals that the number of reasoning steps in CoT prompts significantly influences LLM performance. Key findings include a direct linear correlation between step count and accuracy in few-shot CoT, indicating that increasing reasoning steps enhances LLM reasoning abilities across multiple datasets. Even incorrect rationales can yield favorable outcomes if they maintain the required length of inference. The advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, while complex tasks benefit significantly from longer inference sequences. Zero-shot CoT also benefits from increased reasoning steps, as demonstrated by modifying the initial prompt to encourage more extensive thinking. The study further shows that compressing reasoning steps in few-shot demonstrations can hurt LLM performance, regressing to levels similar to zero-shot methods. Additionally, the required reasoning steps are related to the size of the LLMs, with larger models showing greater tolerance to increased reasoning steps. The research also highlights that the nature of the questions themselves has a lesser impact on performance compared to the length of the reasoning steps. These findings provide valuable insights into optimizing CoT strategies for complex NLP tasks. The study underscores the importance of reasoning step length in enhancing LLM performance and suggests that further research is needed to understand the underlying mechanisms of CoT prompting.