23 May 2024 | Yuntian Deng, Yejin Choi, Stuart Shieber
This paper introduces a novel method called Stepwise Internalization to achieve implicit chain-of-thought (CoT) reasoning in language models. The approach gradually removes intermediate CoT steps from a model trained with explicit CoT, allowing the model to internalize these steps over multiple stages. This process simplifies the reasoning process while maintaining high performance. The method is demonstrated to be effective on various tasks, such as multi-digit multiplication and grade-school math problems, achieving high accuracy with significantly reduced computational costs compared to explicit CoT methods. The authors compare their approach to existing methods, including No CoT, explicit CoT, and implicit CoT via knowledge distillation (ICoT-KD), showing that Stepwise Internalization outperforms these methods in terms of both accuracy and speed. The paper also discusses the limitations of the approach, such as high training costs and instability issues, and suggests future directions for improving the method's scalability and interpretability.This paper introduces a novel method called Stepwise Internalization to achieve implicit chain-of-thought (CoT) reasoning in language models. The approach gradually removes intermediate CoT steps from a model trained with explicit CoT, allowing the model to internalize these steps over multiple stages. This process simplifies the reasoning process while maintaining high performance. The method is demonstrated to be effective on various tasks, such as multi-digit multiplication and grade-school math problems, achieving high accuracy with significantly reduced computational costs compared to explicit CoT methods. The authors compare their approach to existing methods, including No CoT, explicit CoT, and implicit CoT via knowledge distillation (ICoT-KD), showing that Stepwise Internalization outperforms these methods in terms of both accuracy and speed. The paper also discusses the limitations of the approach, such as high training costs and instability issues, and suggests future directions for improving the method's scalability and interpretability.