BESA: PRUNING LARGE LANGUAGE MODELS WITH BLOCKWISE PARAMETER-EFFICIENT SPARSITY ALLOCATION

BESA: PRUNING LARGE LANGUAGE MODELS WITH BLOCKWISE PARAMETER-EFFICIENT SPARSITY ALLOCATION

19 Apr 2024 | Peng Xu112 Wenqi Shao*2 Mengzhao Chen2 Shitao Tang4 Kaipeng Zhang2 Peng Gao2 Fengwei An3 Yu Qiao2 Ping Luo*1,2
The paper introduces a novel pruning technique for large language models (LLMs) called Blockwise Parameter-Efficient Sparsity Allocation (BESA). Unlike traditional layer-wise pruning methods, BESA targets the overall pruning error with respect to individual transformer blocks and allocates layer-specific sparsity in a differentiable manner. This approach ensures reduced performance degradation after pruning. BESA is characterized by two key attributes: i) it minimizes the overall pruning error for each transformer block, and ii) it optimizes layer-wise sparsity allocation using differentiable binary masks. Experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1 and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. The method is parameter-efficient and easy to optimize, making it highly effective and practical for deploying LLMs in various applications.The paper introduces a novel pruning technique for large language models (LLMs) called Blockwise Parameter-Efficient Sparsity Allocation (BESA). Unlike traditional layer-wise pruning methods, BESA targets the overall pruning error with respect to individual transformer blocks and allocates layer-specific sparsity in a differentiable manner. This approach ensures reduced performance degradation after pruning. BESA is characterized by two key attributes: i) it minimizes the overall pruning error for each transformer block, and ii) it optimizes layer-wise sparsity allocation using differentiable binary masks. Experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1 and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. The method is parameter-efficient and easy to optimize, making it highly effective and practical for deploying LLMs in various applications.
Reach us at info@study.space
[slides and audio] BESA%3A Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation