9 Jun 2024 | Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, Ke Ding
This paper proposes a novel method for chain-of-thought (CoT) distillation, aiming to enhance the reasoning capabilities of smaller language models by maximizing mutual information between label prediction and rationale generation tasks. The method is based on the information bottleneck (IB) principle, which helps in capturing the intrinsic relationship between the two tasks. The approach introduces a variational method to estimate mutual information (MI) between the representation features of the two tasks, enabling more effective knowledge transfer. The proposed method outperforms the state-of-the-art DSS method across four datasets, demonstrating improved performance in both label prediction and rationale generation. The method also enhances the alignment between the two tasks, leading to better knowledge transfer from CoT. Comprehensive experiments on two smaller T5 models (T5-base and T5-small) across four popular datasets show that the proposed method achieves superior results compared to existing benchmarks. The method also provides insights into the relationship between predictive and explainable tasks under MTL training, offering both qualitative and quantitative analysis. The study highlights the effectiveness of the proposed approach in enhancing the reasoning capabilities of distilled models and provides a theoretical foundation for future research on CoT distillation. The method is implemented using a learning-based approach, and the code is available for further exploration.This paper proposes a novel method for chain-of-thought (CoT) distillation, aiming to enhance the reasoning capabilities of smaller language models by maximizing mutual information between label prediction and rationale generation tasks. The method is based on the information bottleneck (IB) principle, which helps in capturing the intrinsic relationship between the two tasks. The approach introduces a variational method to estimate mutual information (MI) between the representation features of the two tasks, enabling more effective knowledge transfer. The proposed method outperforms the state-of-the-art DSS method across four datasets, demonstrating improved performance in both label prediction and rationale generation. The method also enhances the alignment between the two tasks, leading to better knowledge transfer from CoT. Comprehensive experiments on two smaller T5 models (T5-base and T5-small) across four popular datasets show that the proposed method achieves superior results compared to existing benchmarks. The method also provides insights into the relationship between predictive and explainable tasks under MTL training, offering both qualitative and quantitative analysis. The study highlights the effectiveness of the proposed approach in enhancing the reasoning capabilities of distilled models and provides a theoretical foundation for future research on CoT distillation. The method is implemented using a learning-based approach, and the code is available for further exploration.