DISTILLM: Towards Streamlined Distillation for Large Language Models

DISTILLM: Towards Streamlined Distillation for Large Language Models

2024 | Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun
DISTILLM is a novel knowledge distillation (KD) framework designed to efficiently compress large language models (LLMs) while maintaining their performance. The framework addresses two major challenges in KD for autoregressive models: (1) the lack of a standardized objective function and (2) the high computational cost associated with using student-generated outputs (SGOs) to reduce training-inference mismatches. DISTILLM introduces a skew Kullback-Leibler divergence (SKL) loss, which improves optimization stability and generalization, and an adaptive off-policy approach that efficiently leverages SGOs to enhance training efficiency. The SKL loss is designed to mitigate issues such as mode averaging and mode collapse by adjusting the mixing ratio of teacher and student distributions. The adaptive off-policy approach dynamically balances the use of SGOs and fixed datasets, reducing the risk of noisy feedback and improving sample efficiency. Extensive experiments demonstrate that DISTILLM achieves up to 4.3× speedup compared to recent KD methods while maintaining high performance on instruction-following, text summarization, and machine translation tasks. The framework's effectiveness is supported by both theoretical analysis and empirical results, showing that DISTILLM outperforms existing KD methods in terms of training efficiency and performance.DISTILLM is a novel knowledge distillation (KD) framework designed to efficiently compress large language models (LLMs) while maintaining their performance. The framework addresses two major challenges in KD for autoregressive models: (1) the lack of a standardized objective function and (2) the high computational cost associated with using student-generated outputs (SGOs) to reduce training-inference mismatches. DISTILLM introduces a skew Kullback-Leibler divergence (SKL) loss, which improves optimization stability and generalization, and an adaptive off-policy approach that efficiently leverages SGOs to enhance training efficiency. The SKL loss is designed to mitigate issues such as mode averaging and mode collapse by adjusting the mixing ratio of teacher and student distributions. The adaptive off-policy approach dynamically balances the use of SGOs and fixed datasets, reducing the risk of noisy feedback and improving sample efficiency. Extensive experiments demonstrate that DISTILLM achieves up to 4.3× speedup compared to recent KD methods while maintaining high performance on instruction-following, text summarization, and machine translation tasks. The framework's effectiveness is supported by both theoretical analysis and empirical results, showing that DISTILLM outperforms existing KD methods in terms of training efficiency and performance.
Reach us at info@study.space
[slides] DistiLLM%3A Towards Streamlined Distillation for Large Language Models | StudySpace