9 Feb 2024 | Saleh Ashkboos†*, Maximilian L. Croci†, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
SliceGPT is a novel post-training sparsification technique designed to reduce the computational and memory requirements of large language models (LLMs). Unlike existing methods that require additional data structures and offer limited speedup, SliceGPT replaces each weight matrix with a smaller dense matrix, reducing the embedding dimension of the network. This approach maintains high zero-shot task performance while significantly reducing the number of parameters and computational resources. Experiments on Llama-2 70B, OPT 66B, and Phi-2 models show that SliceGPT can remove up to 25% of parameters while maintaining 99%, 99%, and 90% of the dense model's zero-shot task performance, respectively. SliceGPT also reduces inference time and GPU usage, achieving up to 1.55× and 1.87× throughput improvements on 40GB A100 GPUs and 24GB RTX6000 GPUs, respectively. The method leverages computational invariance in transformer networks, allowing for efficient pruning without affecting model performance. SliceGPT offers a promising approach to reduce the computational and memory demands of pre-trained models, with potential applications in various downstream tasks.SliceGPT is a novel post-training sparsification technique designed to reduce the computational and memory requirements of large language models (LLMs). Unlike existing methods that require additional data structures and offer limited speedup, SliceGPT replaces each weight matrix with a smaller dense matrix, reducing the embedding dimension of the network. This approach maintains high zero-shot task performance while significantly reducing the number of parameters and computational resources. Experiments on Llama-2 70B, OPT 66B, and Phi-2 models show that SliceGPT can remove up to 25% of parameters while maintaining 99%, 99%, and 90% of the dense model's zero-shot task performance, respectively. SliceGPT also reduces inference time and GPU usage, achieving up to 1.55× and 1.87× throughput improvements on 40GB A100 GPUs and 24GB RTX6000 GPUs, respectively. The method leverages computational invariance in transformer networks, allowing for efficient pruning without affecting model performance. SliceGPT offers a promising approach to reduce the computational and memory demands of pre-trained models, with potential applications in various downstream tasks.