2024 | Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun
**DISTILLM: Towards Streamlined Distillation for Large Language Models**
**Abstract:**
Knowledge distillation (KD) is widely used to compress a teacher model into a smaller student model, reducing inference costs and memory usage while preserving performance. However, current KD methods for auto-regressive sequence models, such as large language models (LLMs), lack a standardized objective function and suffer from computational inefficiencies due to the use of student-generated outputs. To address these issues, we introduce DISTILLM, a more effective and efficient KD framework for auto-regressive language models. DISTILLM consists of two main components: (1) a novel skew Kullback-Leibler (KLD) divergence loss, which we theoretically analyze and show to be more stable and generalizable, and (2) an adaptive off-policy approach that enhances the efficiency of using student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate that DISTILLM builds high-performing student models while achieving up to 4.3× speedup compared to recent KD methods.
**Introduction:**
Recent advancements in auto-regressive language models have significantly improved text generation quality, but the large scale of these models also increases computational costs and memory usage. Knowledge distillation (KD) is a promising method to compress these models while preserving their performance. However, existing KD methods for auto-regressive LMs face challenges such as mode averaging and distribution mismatch between training and inference data. Recent studies have explored various divergence losses and the use of student-generated outputs (SGOs) to address these issues, but they often lack standardized objective functions and are less efficient.
**Contributions:**
- **Skew KLD:** We introduce a new objective function, skew KLD, which addresses the limitations of existing objective functions by improving optimization stability and generalizability.
- **Adaptive Off-Policy Approach:** We propose an adaptive off-policy approach to balance the trade-off between noisy feedback and training-inference mismatch, improving sample efficiency and computational efficiency.
**Background:**
- **KD for Auto-regressive Generative LMs:** The conventional KLD loss is widely used but has limitations, such as mode averaging and distribution mismatch.
- **Pitfalls of Existing Distillation:** Previous methods often suffer from suboptimal performance and task-dependent variability due to the lack of a standardized distillation objective.
**Algorithm 1: Training Pipeline of DISTILLM:**
The training pipeline of DISTILLM includes a skew KLD loss and an adaptive off-policy approach to balance the use of SGOs and improve sample efficiency.
**Results:**
- **Task-Agnostic Instruction-Following:** DISTILLM outperforms state-of-the-art methods in various instruction-following tasks, demonstrating superior performance and efficiency.
- **Text Summarization and Machine Translation:** DISTILLM also shows superior performance in text summarization and machine translation tasks, with improved computational efficiency.
**Conclusion:**
DISTILLM addresses**DISTILLM: Towards Streamlined Distillation for Large Language Models**
**Abstract:**
Knowledge distillation (KD) is widely used to compress a teacher model into a smaller student model, reducing inference costs and memory usage while preserving performance. However, current KD methods for auto-regressive sequence models, such as large language models (LLMs), lack a standardized objective function and suffer from computational inefficiencies due to the use of student-generated outputs. To address these issues, we introduce DISTILLM, a more effective and efficient KD framework for auto-regressive language models. DISTILLM consists of two main components: (1) a novel skew Kullback-Leibler (KLD) divergence loss, which we theoretically analyze and show to be more stable and generalizable, and (2) an adaptive off-policy approach that enhances the efficiency of using student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate that DISTILLM builds high-performing student models while achieving up to 4.3× speedup compared to recent KD methods.
**Introduction:**
Recent advancements in auto-regressive language models have significantly improved text generation quality, but the large scale of these models also increases computational costs and memory usage. Knowledge distillation (KD) is a promising method to compress these models while preserving their performance. However, existing KD methods for auto-regressive LMs face challenges such as mode averaging and distribution mismatch between training and inference data. Recent studies have explored various divergence losses and the use of student-generated outputs (SGOs) to address these issues, but they often lack standardized objective functions and are less efficient.
**Contributions:**
- **Skew KLD:** We introduce a new objective function, skew KLD, which addresses the limitations of existing objective functions by improving optimization stability and generalizability.
- **Adaptive Off-Policy Approach:** We propose an adaptive off-policy approach to balance the trade-off between noisy feedback and training-inference mismatch, improving sample efficiency and computational efficiency.
**Background:**
- **KD for Auto-regressive Generative LMs:** The conventional KLD loss is widely used but has limitations, such as mode averaging and distribution mismatch.
- **Pitfalls of Existing Distillation:** Previous methods often suffer from suboptimal performance and task-dependent variability due to the lack of a standardized distillation objective.
**Algorithm 1: Training Pipeline of DISTILLM:**
The training pipeline of DISTILLM includes a skew KLD loss and an adaptive off-policy approach to balance the use of SGOs and improve sample efficiency.
**Results:**
- **Task-Agnostic Instruction-Following:** DISTILLM outperforms state-of-the-art methods in various instruction-following tasks, demonstrating superior performance and efficiency.
- **Text Summarization and Machine Translation:** DISTILLM also shows superior performance in text summarization and machine translation tasks, with improved computational efficiency.
**Conclusion:**
DISTILLM addresses