Distilling System 2 into System 1

Distilling System 2 into System 1

24 Jul 2024 | Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov
This paper introduces a method to distill System 2 reasoning techniques into System 1 large language models (LLMs) without requiring intermediate reasoning steps. System 2 techniques, such as Chain-of-Thought (CoT), Rephrase and Respond (RaR), System 2 Attention (S2A), and Branch-Solve-Merge (BSM), generate intermediate tokens during reasoning, which can be distilled into System 1 models to improve performance while reducing inference cost. The approach involves using unlabeled data to train System 1 models by distilling the outputs of System 2 methods, often through self-consistency filtering to ensure quality. The study shows that several System 2 techniques can be successfully distilled into System 1, resulting in improved performance compared to the original System 1 model. For example, distilling the 2-step RaR method into a System 1 Llama-2-70B-chat model achieved an accuracy of 98.0% on the last letter concatenation task, significantly outperforming the original System 1 model. Similarly, distilling S2A into System 1 improved performance on biased tasks, while distilling BSM improved evaluation accuracy and reduced computational cost. However, not all tasks can be effectively distilled, particularly complex reasoning tasks that require extensive intermediate steps, such as certain math problems. The study also highlights the importance of data quality in distillation, with self-consistency filtering playing a critical role in ensuring the reliability of the distilled data. The results demonstrate that distilling System 2 reasoning into System 1 can enable LLMs to perform complex tasks more efficiently, with performance comparable to System 2 methods but at a lower cost. This approach has potential applications in future AI systems, allowing them to focus System 2 capabilities on tasks they cannot yet perform well. The study also identifies limitations, such as the difficulty of distilling certain methods like CoT, and the need for further research to determine the optimal conditions for distillation. Overall, the work provides a promising direction for improving LLM performance through efficient reasoning techniques.This paper introduces a method to distill System 2 reasoning techniques into System 1 large language models (LLMs) without requiring intermediate reasoning steps. System 2 techniques, such as Chain-of-Thought (CoT), Rephrase and Respond (RaR), System 2 Attention (S2A), and Branch-Solve-Merge (BSM), generate intermediate tokens during reasoning, which can be distilled into System 1 models to improve performance while reducing inference cost. The approach involves using unlabeled data to train System 1 models by distilling the outputs of System 2 methods, often through self-consistency filtering to ensure quality. The study shows that several System 2 techniques can be successfully distilled into System 1, resulting in improved performance compared to the original System 1 model. For example, distilling the 2-step RaR method into a System 1 Llama-2-70B-chat model achieved an accuracy of 98.0% on the last letter concatenation task, significantly outperforming the original System 1 model. Similarly, distilling S2A into System 1 improved performance on biased tasks, while distilling BSM improved evaluation accuracy and reduced computational cost. However, not all tasks can be effectively distilled, particularly complex reasoning tasks that require extensive intermediate steps, such as certain math problems. The study also highlights the importance of data quality in distillation, with self-consistency filtering playing a critical role in ensuring the reliability of the distilled data. The results demonstrate that distilling System 2 reasoning into System 1 can enable LLMs to perform complex tasks more efficiently, with performance comparable to System 2 methods but at a lower cost. This approach has potential applications in future AI systems, allowing them to focus System 2 capabilities on tasks they cannot yet perform well. The study also identifies limitations, such as the difficulty of distilling certain methods like CoT, and the need for further research to determine the optimal conditions for distillation. Overall, the work provides a promising direction for improving LLM performance through efficient reasoning techniques.
Reach us at info@study.space
[slides] Distilling System 2 into System 1 | StudySpace