This paper introduces BESA, a novel parameter-efficient sparsity allocation technique for pruning large language models (LLMs). BESA addresses the limitations of existing layer-wise pruning methods by introducing a blockwise reconstruction loss, which allows for more efficient and effective pruning. Unlike traditional layer-wise approaches, BESA targets the overall pruning error with respect to individual transformer blocks and allocates layer-specific sparsity in a differentiable manner, reducing performance degradation after pruning. The method is characterized by two key attributes: (1) it minimizes block-wise reconstruction error, and (2) it optimizes sparsity rates across layers in a differentiable manner.
BESA achieves state-of-the-art performance in pruning LLMs such as LLaMA1 and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. The method is parameter-efficient and easy to optimize, demonstrating high efficiency and effectiveness in pruning various LLMs. For example, BESA can prune 50% of the parameters of LLaMA2-70B within five hours on a single A100-80GB GPU with a 0.16 improvement in perplexity on WikiText2 compared to SparseGPT.
The paper also presents a comprehensive LLM compression framework where weight pruning and quantization are jointly optimized in a differentiable manner. Extensive experiments show that BESA achieves state-of-the-art performance in pruning various LLMs such as LLaMA1 and LLaMA2. The method is evaluated on a range of language modeling tasks and various downstream tasks, establishing new state-of-the-art performance compared to prior arts. Finally, the paper demonstrates the practical speedup of the pruned model in a hardware simulator.
The key contributions of this work include: (1) the proposal of a model pruning framework named BESA for compressing LLMs, which searches for optimal pruning rates for each layer in a differentiable manner; (2) the parameter efficiency and effectiveness of BESA in pruning various LLMs such as LLaMA1 and LLaMA2; and (3) the establishment of new state-of-the-art performance in pruning LLMs on various language modeling tasks and downstream tasks.This paper introduces BESA, a novel parameter-efficient sparsity allocation technique for pruning large language models (LLMs). BESA addresses the limitations of existing layer-wise pruning methods by introducing a blockwise reconstruction loss, which allows for more efficient and effective pruning. Unlike traditional layer-wise approaches, BESA targets the overall pruning error with respect to individual transformer blocks and allocates layer-specific sparsity in a differentiable manner, reducing performance degradation after pruning. The method is characterized by two key attributes: (1) it minimizes block-wise reconstruction error, and (2) it optimizes sparsity rates across layers in a differentiable manner.
BESA achieves state-of-the-art performance in pruning LLMs such as LLaMA1 and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. The method is parameter-efficient and easy to optimize, demonstrating high efficiency and effectiveness in pruning various LLMs. For example, BESA can prune 50% of the parameters of LLaMA2-70B within five hours on a single A100-80GB GPU with a 0.16 improvement in perplexity on WikiText2 compared to SparseGPT.
The paper also presents a comprehensive LLM compression framework where weight pruning and quantization are jointly optimized in a differentiable manner. Extensive experiments show that BESA achieves state-of-the-art performance in pruning various LLMs such as LLaMA1 and LLaMA2. The method is evaluated on a range of language modeling tasks and various downstream tasks, establishing new state-of-the-art performance compared to prior arts. Finally, the paper demonstrates the practical speedup of the pruned model in a hardware simulator.
The key contributions of this work include: (1) the proposal of a model pruning framework named BESA for compressing LLMs, which searches for optimal pruning rates for each layer in a differentiable manner; (2) the parameter efficiency and effectiveness of BESA in pruning various LLMs such as LLaMA1 and LLaMA2; and (3) the establishment of new state-of-the-art performance in pruning LLMs on various language modeling tasks and downstream tasks.