The paper introduces a novel pruning technique for large language models (LLMs) called Blockwise Parameter-Efficient Sparsity Allocation (BESA). Unlike traditional layer-wise pruning methods, BESA targets the overall pruning error with respect to individual transformer blocks and allocates layer-specific sparsity in a differentiable manner. This approach ensures reduced performance degradation after pruning. BESA is characterized by two key attributes: i) it minimizes the overall pruning error for each transformer block, and ii) it optimizes layer-wise sparsity allocation using differentiable binary masks. Experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1 and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. The method is parameter-efficient and easy to optimize, making it highly effective and practical for deploying LLMs in various applications.The paper introduces a novel pruning technique for large language models (LLMs) called Blockwise Parameter-Efficient Sparsity Allocation (BESA). Unlike traditional layer-wise pruning methods, BESA targets the overall pruning error with respect to individual transformer blocks and allocates layer-specific sparsity in a differentiable manner. This approach ensures reduced performance degradation after pruning. BESA is characterized by two key attributes: i) it minimizes the overall pruning error for each transformer block, and ii) it optimizes layer-wise sparsity allocation using differentiable binary masks. Experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1 and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. The method is parameter-efficient and easy to optimize, making it highly effective and practical for deploying LLMs in various applications.