Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

2024 | Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong
This paper challenges the common belief that reverse KL (RKL) divergence is mode-seeking and FKL divergence is mean-seeking in Knowledge Distillation (KD) for Large Language Models (LLMs). Through empirical and theoretical analysis, the study demonstrates that neither FKL nor RKL exhibits the expected mode-seeking or mean-seeking behaviors in KD for LLMs. Instead, both FKL and RKL converge to the same optimization objective after sufficient training epochs. However, due to practical constraints, LLMs are rarely trained for such a long duration. The study further reveals that at the beginning of training, RKL focuses on the tail of the teacher distribution, while FKL focuses on the head. Based on this observation, the authors propose an Adaptive Kullback-Leibler (AKL) divergence method that adaptively combines FKL and RKL by adjusting weights based on the head and tail parts of the distributions. AKL outperforms existing methods across various tasks and improves the diversity and quality of generated responses. Evaluations using metrics such as Rouge-L scores and GPT-4 scores show that AKL achieves superior performance. The method is effective in aligning the distributions of student and teacher models, leading to better performance in both diversity and quality of generated responses. The study also highlights the importance of considering the practical limitations of training LLMs and the need for adaptive approaches in KD. The proposed AKL method is a simple yet effective solution that adapts to the characteristics of the distributions during training, leading to improved performance in knowledge distillation for LLMs.This paper challenges the common belief that reverse KL (RKL) divergence is mode-seeking and FKL divergence is mean-seeking in Knowledge Distillation (KD) for Large Language Models (LLMs). Through empirical and theoretical analysis, the study demonstrates that neither FKL nor RKL exhibits the expected mode-seeking or mean-seeking behaviors in KD for LLMs. Instead, both FKL and RKL converge to the same optimization objective after sufficient training epochs. However, due to practical constraints, LLMs are rarely trained for such a long duration. The study further reveals that at the beginning of training, RKL focuses on the tail of the teacher distribution, while FKL focuses on the head. Based on this observation, the authors propose an Adaptive Kullback-Leibler (AKL) divergence method that adaptively combines FKL and RKL by adjusting weights based on the head and tail parts of the distributions. AKL outperforms existing methods across various tasks and improves the diversity and quality of generated responses. Evaluations using metrics such as Rouge-L scores and GPT-4 scores show that AKL achieves superior performance. The method is effective in aligning the distributions of student and teacher models, leading to better performance in both diversity and quality of generated responses. The study also highlights the importance of considering the practical limitations of training LLMs and the need for adaptive approaches in KD. The proposed AKL method is a simple yet effective solution that adapts to the characteristics of the distributions during training, leading to improved performance in knowledge distillation for LLMs.
Reach us at info@study.space
[slides] Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models | StudySpace