2024 | Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
SliceGPT is a post-training sparsification method that reduces the size of large language models (LLMs) by removing rows and columns from weight matrices, thereby reducing the embedding dimension. This approach maintains high performance on both generation and downstream tasks. SliceGPT achieves significant parameter reduction, up to 25% for LLAMA-2 70B, OPT 66B, and Phi-2 models, while maintaining 99%, 99%, and 90% of the dense model's zero-shot task performance, respectively. The method leverages computational invariance in transformer networks, allowing for efficient pruning without additional code optimization. SliceGPT runs on fewer GPUs and is faster than dense models, with inference costs reduced to 64% and 66% of the dense model on 24GB and 40GB GPUs, respectively. The method is applicable to various LLMs, including OPT, LLAMA-2, and Phi-2, and can be further improved with recovery fine-tuning. SliceGPT offers a new approach to model compression, enabling more efficient deployment of large language models.SliceGPT is a post-training sparsification method that reduces the size of large language models (LLMs) by removing rows and columns from weight matrices, thereby reducing the embedding dimension. This approach maintains high performance on both generation and downstream tasks. SliceGPT achieves significant parameter reduction, up to 25% for LLAMA-2 70B, OPT 66B, and Phi-2 models, while maintaining 99%, 99%, and 90% of the dense model's zero-shot task performance, respectively. The method leverages computational invariance in transformer networks, allowing for efficient pruning without additional code optimization. SliceGPT runs on fewer GPUs and is faster than dense models, with inference costs reduced to 64% and 66% of the dense model on 24GB and 40GB GPUs, respectively. The method is applicable to various LLMs, including OPT, LLAMA-2, and Phi-2, and can be further improved with recovery fine-tuning. SliceGPT offers a new approach to model compression, enabling more efficient deployment of large language models.