Revisiting Knowledge Distillation for Autoregressive Language Models

Revisiting Knowledge Distillation for Autoregressive Language Models

17 Jun 2024 | Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du, Dacheng Tao
This paper presents a novel adaptive teaching approach (ATKD) for knowledge distillation (KD) in autoregressive language models (LMs). The authors identify that traditional KD methods suffer from performance degradation when using larger teacher models, as different tokens require different teaching strategies. They propose ATKD, which adapts teaching modes based on token difficulty, reducing rote learning and promoting diverse knowledge acquisition. The core idea is to focus more on hard-to-learn tokens and less on easy-to-learn ones, leading to improved student model generalization. The authors analyze the KD objective, reformulating it into target-oriented (TKD) and diversity-oriented (DKD) components. They find that DKD is more important but often suppressed by the teacher's uncertainty. ATKD decouples TKD and DKD, allowing for more flexible teaching. Experiments on 8 LM tasks show that ATKD significantly improves performance across all model sizes, achieving up to +3.04% average gains. It also enhances the generalization of distilled students. The paper evaluates ATKD on various LM benchmarks, including 5 language generation tasks and 3 language understanding tasks, across three types of autoregressive LMs: OPT, Pythia, and LLaMA. Results show that ATKD outperforms standard KD methods, particularly in larger models. Ablation studies confirm that the choice of token ratio (k) and coefficient (λ) significantly affects performance, with optimal values of k=50% and λ=0.2. The authors also discuss related works, noting that existing KD methods often fail to address performance degradation in larger autoregressive LMs. They argue that their approach is more effective in improving teaching quality and model generalization. The paper concludes that ATKD is a promising solution for improving KD in autoregressive LM compression.This paper presents a novel adaptive teaching approach (ATKD) for knowledge distillation (KD) in autoregressive language models (LMs). The authors identify that traditional KD methods suffer from performance degradation when using larger teacher models, as different tokens require different teaching strategies. They propose ATKD, which adapts teaching modes based on token difficulty, reducing rote learning and promoting diverse knowledge acquisition. The core idea is to focus more on hard-to-learn tokens and less on easy-to-learn ones, leading to improved student model generalization. The authors analyze the KD objective, reformulating it into target-oriented (TKD) and diversity-oriented (DKD) components. They find that DKD is more important but often suppressed by the teacher's uncertainty. ATKD decouples TKD and DKD, allowing for more flexible teaching. Experiments on 8 LM tasks show that ATKD significantly improves performance across all model sizes, achieving up to +3.04% average gains. It also enhances the generalization of distilled students. The paper evaluates ATKD on various LM benchmarks, including 5 language generation tasks and 3 language understanding tasks, across three types of autoregressive LMs: OPT, Pythia, and LLaMA. Results show that ATKD outperforms standard KD methods, particularly in larger models. Ablation studies confirm that the choice of token ratio (k) and coefficient (λ) significantly affects performance, with optimal values of k=50% and λ=0.2. The authors also discuss related works, noting that existing KD methods often fail to address performance degradation in larger autoregressive LMs. They argue that their approach is more effective in improving teaching quality and model generalization. The paper concludes that ATKD is a promising solution for improving KD in autoregressive LM compression.
Reach us at info@study.space
Understanding Revisiting Knowledge Distillation for Autoregressive Language Models