24 Jul 2024 | Ping Yu Jing Xu Jason Weston Ilia Kulikov
This paper explores the technique of distilling higher-quality outputs from System 2 techniques back into System 1 (non-Reasoning) generations in large language models (LLMs). System 2 techniques, such as Chain-of-Thought, Rephrase and Respond, System 2 Attention, and Branch-Solve-Merge, generate intermediate tokens to improve reasoning and accuracy but at the cost of increased inference time. The authors propose a self-supervised method to distill these System 2 outputs into System 1 generations without intermediate reasoning tokens, reducing inference costs while maintaining or improving performance. The approach involves collecting high-quality outputs from System 2 methods on unlabeled data, curating these outputs using self-consistency criteria, and fine-tuning a System 1 model to match the System 2 predictions. Experiments across four System 2 methods and five tasks show that the distilled System 1 models outperform the original System 1 models and achieve similar or better results compared to the System 2 methods, with reduced inference costs. However, the method is not always effective, particularly for complex tasks requiring chain-of-thought reasoning. The paper concludes by discussing the potential of System 2 distillation in future AI systems, enabling them to focus on reasoning tasks they cannot yet handle well.This paper explores the technique of distilling higher-quality outputs from System 2 techniques back into System 1 (non-Reasoning) generations in large language models (LLMs). System 2 techniques, such as Chain-of-Thought, Rephrase and Respond, System 2 Attention, and Branch-Solve-Merge, generate intermediate tokens to improve reasoning and accuracy but at the cost of increased inference time. The authors propose a self-supervised method to distill these System 2 outputs into System 1 generations without intermediate reasoning tokens, reducing inference costs while maintaining or improving performance. The approach involves collecting high-quality outputs from System 2 methods on unlabeled data, curating these outputs using self-consistency criteria, and fine-tuning a System 1 model to match the System 2 predictions. Experiments across four System 2 methods and five tasks show that the distilled System 1 models outperform the original System 1 models and achieve similar or better results compared to the System 2 methods, with reduced inference costs. However, the method is not always effective, particularly for complex tasks requiring chain-of-thought reasoning. The paper concludes by discussing the potential of System 2 distillation in future AI systems, enabling them to focus on reasoning tasks they cannot yet handle well.