[slides] Compact Language Models via Pruning and Knowledge Distillation

This paper explores the feasibility of compressing large language models (LLMs) by pruning and retraining, aiming to reduce the computational cost of training multiple models of different sizes. The authors develop a set of practical compression best practices that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining. These practices are derived through detailed empirical exploration and are applied to compress the Nemo3-4 family of LLMs, resulting in models that are 2-4 times smaller while maintaining or improving performance compared to similarly-sized models. The MINITRON models, derived from a 15B parameter model, require up to 40 times fewer training tokens compared to training from scratch, leading to significant cost savings. MINITRON 8B outperforms or matches the performance of several community models, including Mistral 7B, Gemma 7B, and Llama-3 8B, and achieves a 16% improvement in MMLU scores. The paper also introduces a lightweight neural architecture search algorithm and provides a comprehensive list of LLM compression and retraining best practices.This paper explores the feasibility of compressing large language models (LLMs) by pruning and retraining, aiming to reduce the computational cost of training multiple models of different sizes. The authors develop a set of practical compression best practices that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining. These practices are derived through detailed empirical exploration and are applied to compress the Nemo3-4 family of LLMs, resulting in models that are 2-4 times smaller while maintaining or improving performance compared to similarly-sized models. The MINITRON models, derived from a 15B parameter model, require up to 40 times fewer training tokens compared to training from scratch, leading to significant cost savings. MINITRON 8B outperforms or matches the performance of several community models, including Mistral 7B, Gemma 7B, and Llama-3 8B, and achieves a 16% improvement in MMLU scores. The paper also introduces a lightweight neural architecture search algorithm and provides a comprehensive list of LLM compression and retraining best practices.

Compact Language Models via Pruning and Knowledge Distillation

19 Jul 2024 | Saurav Muralidharan*, Sharath Turuvekere Sreenivas*, Raviraj Joshi Marcin Chochowski Mostofa Patwary Mohammad Shoeybi Bryan Catanzaro Jan Kautz Pavlo Molchanov

19 Jul 2024 | Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi Marcin Chochowski Mostofa Patwary Mohammad Shoeybi Bryan Catanzaro Jan Kautz Pavlo Molchanov