Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

6 May 2024 | Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz
This paper introduces a novel approach to create accurate, sparse foundational versions of large language models (LLMs) that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. The method combines sparse pretraining and pruning with efficient deployment techniques. For the LLaMA-2 7B model, sparse pretraining is achieved using a modified version of the SparseGPT algorithm, with further pretraining on subsets of the SlimPajama and The Stack datasets. This approach enables higher accuracy recovery during fine-tuning compared to existing methods. The study demonstrates significant improvements in training and inference speed. Sparse training on Cerebras CS-3 chips achieves near-theoretical scaling, while inference on CPUs and GPUs is accelerated by up to 3x and 1.7x, respectively, using Neural Magic's DeepSparse and nm-vllm engines. Combining sparsity with quantization further enhances performance, achieving up to 8.6x speedup on CPUs for sparse-quantized LLaMA models. The results are validated across diverse tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization, proving the generality of the approach. The sparse models, along with code and documentation, are open-sourced to promote reproducibility and expansion of the results. The methodology includes sparse pretraining, sparse fine-tuning, and efficient inference techniques. Sparse pretraining accelerates training on Cerebras CS-3 chips, while sparse inference on CPUs and GPUs is optimized using bitmask expansion and efficient memory management. The combination of sparsity and quantization leads to significant performance gains, enabling the creation of smaller, faster, and more accessible LLMs without sacrificing accuracy. The work highlights the potential of sparse and quantized LLMs for efficient deployment and broader accessibility.This paper introduces a novel approach to create accurate, sparse foundational versions of large language models (LLMs) that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. The method combines sparse pretraining and pruning with efficient deployment techniques. For the LLaMA-2 7B model, sparse pretraining is achieved using a modified version of the SparseGPT algorithm, with further pretraining on subsets of the SlimPajama and The Stack datasets. This approach enables higher accuracy recovery during fine-tuning compared to existing methods. The study demonstrates significant improvements in training and inference speed. Sparse training on Cerebras CS-3 chips achieves near-theoretical scaling, while inference on CPUs and GPUs is accelerated by up to 3x and 1.7x, respectively, using Neural Magic's DeepSparse and nm-vllm engines. Combining sparsity with quantization further enhances performance, achieving up to 8.6x speedup on CPUs for sparse-quantized LLaMA models. The results are validated across diverse tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization, proving the generality of the approach. The sparse models, along with code and documentation, are open-sourced to promote reproducibility and expansion of the results. The methodology includes sparse pretraining, sparse fine-tuning, and efficient inference techniques. Sparse pretraining accelerates training on Cerebras CS-3 chips, while sparse inference on CPUs and GPUs is optimized using bitmask expansion and efficient memory management. The combination of sparsity and quantization leads to significant performance gains, enabling the creation of smaller, faster, and more accessible LLMs without sacrificing accuracy. The work highlights the potential of sparse and quantized LLMs for efficient deployment and broader accessibility.
Reach us at info@study.space
[slides and audio] Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment