Understanding Learning to Maximize Mutual Information for Chain-of-Thought Distillation

This paper addresses the challenge of knowledge distillation, specifically focusing on Chain-of-Thought (CoT) distillation, which aims to transfer the superior reasoning capabilities of larger models to smaller ones. The Distilling Step-by-Step (DSS) method, while effective, struggles with the integration of label prediction and rationale generation tasks, leading to suboptimal performance. To overcome this, the authors propose a novel approach that models the DSS framework using the Information Bottleneck (IB) principle and formulates it as an optimization problem to maximize the mutual information (MI) between the two tasks. They introduce a variational method to estimate MI, enhancing the alignment between label prediction and rationale generation. Experimental results on four datasets demonstrate that their method outperforms state-of-the-art DSS, improving the reasoning capabilities of distilled models. The paper also includes a detailed analysis of the effectiveness of the proposed method, including calibration and CoT output quality, providing insights into the broader implications of their work on language model distillation and CoT applications.This paper addresses the challenge of knowledge distillation, specifically focusing on Chain-of-Thought (CoT) distillation, which aims to transfer the superior reasoning capabilities of larger models to smaller ones. The Distilling Step-by-Step (DSS) method, while effective, struggles with the integration of label prediction and rationale generation tasks, leading to suboptimal performance. To overcome this, the authors propose a novel approach that models the DSS framework using the Information Bottleneck (IB) principle and formulates it as an optimization problem to maximize the mutual information (MI) between the two tasks. They introduce a variational method to estimate MI, enhancing the alignment between label prediction and rationale generation. Experimental results on four datasets demonstrate that their method outperforms state-of-the-art DSS, improving the reasoning capabilities of distilled models. The paper also includes a detailed analysis of the effectiveness of the proposed method, including calibration and CoT output quality, providing insights into the broader implications of their work on language model distillation and CoT applications.

Learning to Maximize Mutual Information for Chain-of-Thought Distillation

9 Jun 2024 | Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, Ke Ding