Compact Language Models via Pruning and Knowledge Distillation

Compact Language Models via Pruning and Knowledge Distillation

19 Jul 2024 | Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Pavlo Molchanov
This paper presents a comprehensive approach to compressing large language models (LLMs) through pruning and knowledge distillation. The authors investigate whether pruning an existing LLM and retraining it with a fraction of the original training data can serve as an effective alternative to full retraining. They develop a set of best practices for LLM compression that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining. These practices are derived from an extensive empirical exploration of pruning strategies, methods to combine axes, distillation strategies, and search techniques for optimal compressed architectures. The authors apply these techniques to compress the Nemotron-4 family of LLMs by a factor of 2-4×, achieving significant cost savings in training. The resulting MINITRON models outperform state-of-the-art compression techniques and perform comparably to other community models. MINITRON 8B achieves better accuracy than Nemotron-3 8B using 40× fewer training tokens and performs similarly to Mistral 7B, Gemma 7B, and Llama-3 8B. MINITRON 4B outperforms the similarly-sized Gemma2 model and compares favorably to the Phi-2 model. The paper introduces a practical list of LLM compression and retraining best practices, including the use of structured pruning, knowledge distillation, and iterative pruning and distillation. The authors also explore various retraining strategies, including conventional training and knowledge distillation, and find that knowledge distillation leads to better performance. They demonstrate that their approach significantly reduces training costs and improves model accuracy compared to training from scratch. The MINITRON models are open-sourced on Huggingface, with corresponding supplementary material available on GitHub.This paper presents a comprehensive approach to compressing large language models (LLMs) through pruning and knowledge distillation. The authors investigate whether pruning an existing LLM and retraining it with a fraction of the original training data can serve as an effective alternative to full retraining. They develop a set of best practices for LLM compression that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining. These practices are derived from an extensive empirical exploration of pruning strategies, methods to combine axes, distillation strategies, and search techniques for optimal compressed architectures. The authors apply these techniques to compress the Nemotron-4 family of LLMs by a factor of 2-4×, achieving significant cost savings in training. The resulting MINITRON models outperform state-of-the-art compression techniques and perform comparably to other community models. MINITRON 8B achieves better accuracy than Nemotron-3 8B using 40× fewer training tokens and performs similarly to Mistral 7B, Gemma 7B, and Llama-3 8B. MINITRON 4B outperforms the similarly-sized Gemma2 model and compares favorably to the Phi-2 model. The paper introduces a practical list of LLM compression and retraining best practices, including the use of structured pruning, knowledge distillation, and iterative pruning and distillation. The authors also explore various retraining strategies, including conventional training and knowledge distillation, and find that knowledge distillation leads to better performance. They demonstrate that their approach significantly reduces training costs and improves model accuracy compared to training from scratch. The MINITRON models are open-sourced on Huggingface, with corresponding supplementary material available on GitHub.
Reach us at info@study.space