ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

7 Jan 2025 | Chenyang Song, Xu Han*, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun
ProSparse is a novel method introduced in the paper "ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models" to enhance activation sparsity in large language models (LLMs) without compromising performance. The method aims to address the computational challenges associated with LLMs by leveraging intrinsic activation sparsity, which refers to the presence of weakly-contributed elements in activation outputs that can be pruned during inference to save computational resources. The key contributions of ProSparse are: 1. **Effective ReLUfication**: ProSparse converts non-ReLU LLMs into ReLU-activated models, achieving high activation sparsity. 2. **Performance Comparison**: The sparsely activated models obtained by ProSparse achieve comparable performance to their original Swish-activated versions on various benchmarks. 3. **Inference Acceleration**: ProSparse demonstrates significant inference acceleration, with speedups up to 4.52× using approximate algorithms and up to 2.44× and 1.70× using accurate GPU operators. The method consists of three main steps: 1. **Activation Function Substitution**: Replacing the activation function with ReLU and continuing training. 2. **Progressive Sparsity Regularization**: Applying $L_1$ regularization to intermediate activation outputs to increase sparsity gradually, avoiding radical shifts in activation distributions. 3. **Activation Threshold Shifting**: Shifting the ReLU activation threshold to further enhance sparsity. Experiments on LLaMA2-7B, LLaMA2-13B, and MiniCPM-1B models show that ProSparse achieves high sparsity (89.32%, 88.80%, and 87.89% respectively) while maintaining comparable performance. The practical inference acceleration results using both approximate and accurate algorithms further validate the effectiveness of ProSparse. The paper also discusses the impact of $L_1$ regularization, the controllability of sparsity, and the distribution of sparsity across different datasets and layers.ProSparse is a novel method introduced in the paper "ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models" to enhance activation sparsity in large language models (LLMs) without compromising performance. The method aims to address the computational challenges associated with LLMs by leveraging intrinsic activation sparsity, which refers to the presence of weakly-contributed elements in activation outputs that can be pruned during inference to save computational resources. The key contributions of ProSparse are: 1. **Effective ReLUfication**: ProSparse converts non-ReLU LLMs into ReLU-activated models, achieving high activation sparsity. 2. **Performance Comparison**: The sparsely activated models obtained by ProSparse achieve comparable performance to their original Swish-activated versions on various benchmarks. 3. **Inference Acceleration**: ProSparse demonstrates significant inference acceleration, with speedups up to 4.52× using approximate algorithms and up to 2.44× and 1.70× using accurate GPU operators. The method consists of three main steps: 1. **Activation Function Substitution**: Replacing the activation function with ReLU and continuing training. 2. **Progressive Sparsity Regularization**: Applying $L_1$ regularization to intermediate activation outputs to increase sparsity gradually, avoiding radical shifts in activation distributions. 3. **Activation Threshold Shifting**: Shifting the ReLU activation threshold to further enhance sparsity. Experiments on LLaMA2-7B, LLaMA2-13B, and MiniCPM-1B models show that ProSparse achieves high sparsity (89.32%, 88.80%, and 87.89% respectively) while maintaining comparable performance. The practical inference acceleration results using both approximate and accurate algorithms further validate the effectiveness of ProSparse. The paper also discusses the impact of $L_1$ regularization, the controllability of sparsity, and the distribution of sparsity across different datasets and layers.
Reach us at info@study.space