[slides and audio] Revisiting Knowledge Distillation for Autoregressive Language Models

The paper "Revisiting Knowledge Distillation for Autoregressive Language Models" addresses the issue of performance degradation when using larger teacher models in knowledge distillation (KD) for autoregressive language models (LMs). The authors find that larger teachers can lead to poorer student models, especially when the capability gap is significant. They propose a novel adaptive teaching approach (ATKD) to improve KD by reducing role learning and making teaching more diverse and flexible. ATKD distinguishes between different tokens, focusing on target-oriented knowledge distillation (TKD) for easy-to-learn tokens and diversity-oriented knowledge distillation (DKD) for hard-to-learn tokens. Extensive experiments on various LM benchmarks show that ATKD consistently and significantly improves performance across different model sizes and types, achieving up to +3.04% average gains. Additionally, ATKD enhances the generalization of distilled students. The paper also includes an evaluation of ATKD's effectiveness in larger models and discusses its limitations and potential future work.The paper "Revisiting Knowledge Distillation for Autoregressive Language Models" addresses the issue of performance degradation when using larger teacher models in knowledge distillation (KD) for autoregressive language models (LMs). The authors find that larger teachers can lead to poorer student models, especially when the capability gap is significant. They propose a novel adaptive teaching approach (ATKD) to improve KD by reducing role learning and making teaching more diverse and flexible. ATKD distinguishes between different tokens, focusing on target-oriented knowledge distillation (TKD) for easy-to-learn tokens and diversity-oriented knowledge distillation (DKD) for hard-to-learn tokens. Extensive experiments on various LM benchmarks show that ATKD consistently and significantly improves performance across different model sizes and types, achieving up to +3.04% average gains. Additionally, ATKD enhances the generalization of distilled students. The paper also includes an evaluation of ATKD's effectiveness in larger models and discusses its limitations and potential future work.

Revisiting Knowledge Distillation for Autoregressive Language Models

17 Jun 2024 | Qihuang Zhong1, Liang Ding2, Li Shen3, Juhua Liu1*, Bo Du1*, Dacheng Tao4

17 Jun 2024 | Qihuang Zhong1, Liang Ding2, Li Shen3, Juhua Liu1, Bo Du1, Dacheng Tao4