[slides] Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

This paper reconsiders the use of Kullback-Leibler (KL) divergence in Knowledge Distillation (KD) for Large Language Models (LLMs). Contrary to previous claims that reverse KL (RKL) divergence is mode-seeking and thus superior to forward KL (FKL) divergence, the authors empirically and theoretically demonstrate that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, both FKL and RKL share the same optimization objective and converge after a sufficient number of epochs. However, due to practical constraints, LLMs are rarely trained for such a long duration. The authors further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning of training. To address this, they propose a novel Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Evaluations on various benchmarks show that AKL outperforms baselines across different tasks, improving the diversity and quality of generated responses.This paper reconsiders the use of Kullback-Leibler (KL) divergence in Knowledge Distillation (KD) for Large Language Models (LLMs). Contrary to previous claims that reverse KL (RKL) divergence is mode-seeking and thus superior to forward KL (FKL) divergence, the authors empirically and theoretically demonstrate that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, both FKL and RKL share the same optimization objective and converge after a sufficient number of epochs. However, due to practical constraints, LLMs are rarely trained for such a long duration. The authors further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning of training. To address this, they propose a novel Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Evaluations on various benchmarks show that AKL outperforms baselines across different tasks, improving the diversity and quality of generated responses.

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

16 Jun 2024 | Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong