[slides] DistiLLM%3A Towards Streamlined Distillation for Large Language Models

**DISTILLM: Towards Streamlined Distillation for Large Language Models** **Abstract:** Knowledge distillation (KD) is widely used to compress a teacher model into a smaller student model, reducing inference costs and memory usage while preserving performance. However, current KD methods for auto-regressive sequence models, such as large language models (LLMs), lack a standardized objective function and suffer from computational inefficiencies due to the use of student-generated outputs. To address these issues, we introduce DISTILLM, a more effective and efficient KD framework for auto-regressive language models. DISTILLM consists of two main components: (1) a novel skew Kullback-Leibler (KLD) divergence loss, which we theoretically analyze and show to be more stable and generalizable, and (2) an adaptive off-policy approach that enhances the efficiency of using student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate that DISTILLM builds high-performing student models while achieving up to 4.3× speedup compared to recent KD methods. **Introduction:** Recent advancements in auto-regressive language models have significantly improved text generation quality, but the large scale of these models also increases computational costs and memory usage. Knowledge distillation (KD) is a promising method to compress these models while preserving their performance. However, existing KD methods for auto-regressive LMs face challenges such as mode averaging and distribution mismatch between training and inference data. Recent studies have explored various divergence losses and the use of student-generated outputs (SGOs) to address these issues, but they often lack standardized objective functions and are less efficient. **Contributions:** - **Skew KLD:** We introduce a new objective function, skew KLD, which addresses the limitations of existing objective functions by improving optimization stability and generalizability. - **Adaptive Off-Policy Approach:** We propose an adaptive off-policy approach to balance the trade-off between noisy feedback and training-inference mismatch, improving sample efficiency and computational efficiency. **Background:** - **KD for Auto-regressive Generative LMs:** The conventional KLD loss is widely used but has limitations, such as mode averaging and distribution mismatch. - **Pitfalls of Existing Distillation:** Previous methods often suffer from suboptimal performance and task-dependent variability due to the lack of a standardized distillation objective. **Algorithm 1: Training Pipeline of DISTILLM:** The training pipeline of DISTILLM includes a skew KLD loss and an adaptive off-policy approach to balance the use of SGOs and improve sample efficiency. **Results:** - **Task-Agnostic Instruction-Following:** DISTILLM outperforms state-of-the-art methods in various instruction-following tasks, demonstrating superior performance and efficiency. - **Text Summarization and Machine Translation:** DISTILLM also shows superior performance in text summarization and machine translation tasks, with improved computational efficiency. **Conclusion:** DISTILLM addresses**DISTILLM: Towards Streamlined Distillation for Large Language Models** **Abstract:** Knowledge distillation (KD) is widely used to compress a teacher model into a smaller student model, reducing inference costs and memory usage while preserving performance. However, current KD methods for auto-regressive sequence models, such as large language models (LLMs), lack a standardized objective function and suffer from computational inefficiencies due to the use of student-generated outputs. To address these issues, we introduce DISTILLM, a more effective and efficient KD framework for auto-regressive language models. DISTILLM consists of two main components: (1) a novel skew Kullback-Leibler (KLD) divergence loss, which we theoretically analyze and show to be more stable and generalizable, and (2) an adaptive off-policy approach that enhances the efficiency of using student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate that DISTILLM builds high-performing student models while achieving up to 4.3× speedup compared to recent KD methods. **Introduction:** Recent advancements in auto-regressive language models have significantly improved text generation quality, but the large scale of these models also increases computational costs and memory usage. Knowledge distillation (KD) is a promising method to compress these models while preserving their performance. However, existing KD methods for auto-regressive LMs face challenges such as mode averaging and distribution mismatch between training and inference data. Recent studies have explored various divergence losses and the use of student-generated outputs (SGOs) to address these issues, but they often lack standardized objective functions and are less efficient. **Contributions:** - **Skew KLD:** We introduce a new objective function, skew KLD, which addresses the limitations of existing objective functions by improving optimization stability and generalizability. - **Adaptive Off-Policy Approach:** We propose an adaptive off-policy approach to balance the trade-off between noisy feedback and training-inference mismatch, improving sample efficiency and computational efficiency. **Background:** - **KD for Auto-regressive Generative LMs:** The conventional KLD loss is widely used but has limitations, such as mode averaging and distribution mismatch. - **Pitfalls of Existing Distillation:** Previous methods often suffer from suboptimal performance and task-dependent variability due to the lack of a standardized distillation objective. **Algorithm 1: Training Pipeline of DISTILLM:** The training pipeline of DISTILLM includes a skew KLD loss and an adaptive off-policy approach to balance the use of SGOs and improve sample efficiency. **Results:** - **Task-Agnostic Instruction-Following:** DISTILLM outperforms state-of-the-art methods in various instruction-following tasks, demonstrating superior performance and efficiency. - **Text Summarization and Machine Translation:** DISTILLM also shows superior performance in text summarization and machine translation tasks, with improved computational efficiency. **Conclusion:** DISTILLM addresses

DISTILLM: Towards Streamlined Distillation for Large Language Models

2024 | Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun