ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

7 Jan 2025 | Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun
ProSparse is a method to introduce and enhance intrinsic activation sparsity in large language models (LLMs) without performance degradation. The method involves three steps: activation function substitution, progressive sparsity regularization, and activation threshold shifting. Activation function substitution replaces the original activation function with ReLU, which naturally outputs zero elements. Progressive sparsity regularization gradually increases the sparsity factor to enhance activation sparsity while avoiding drastic changes in activation distribution. Activation threshold shifting adjusts the ReLU activation threshold to a positive value, pruning less influential neurons to improve sparsity. ProSparse achieves high activation sparsity in LLaMA2-7B (89.32%), LLaMA2-13B (88.80%), and MiniCPM-1B (87.89%), with performance comparable to their original Swish-activated versions. These models are among the most sparsely activated open-source LLaMA versions and competitive end-size models. Inference acceleration experiments show that higher activation sparsity significantly improves inference speed, achieving up to 4.52× speedup. ProSparse also demonstrates practical inference acceleration through approximate and accurate algorithms. Approximate algorithms, such as PowerInfer, achieve high speedup ratios but may have inaccuracies due to activation predictor errors. Accurate algorithms, such as sparse GPU operators, exploit input-side and output-side sparsity for efficient inference. Experiments show that higher activation sparsity improves inference speed and accuracy with both approximate and accurate algorithms. Analysis shows that the sparsity of ProSparse models is mainly influenced by the final-stage regularization factor. Progressive sparsity regularization is essential for maintaining performance during sparsity introduction. The sparsity distribution varies across datasets and layers, with higher sparsity in more formatted instruction tuning datasets and higher layers. ProSparse is a highly controllable method for sparsity adjustment, with the ability to reach high sparsity without performance degradation. Future work will explore sparsity in attention layers and optimization of FFN step (1).ProSparse is a method to introduce and enhance intrinsic activation sparsity in large language models (LLMs) without performance degradation. The method involves three steps: activation function substitution, progressive sparsity regularization, and activation threshold shifting. Activation function substitution replaces the original activation function with ReLU, which naturally outputs zero elements. Progressive sparsity regularization gradually increases the sparsity factor to enhance activation sparsity while avoiding drastic changes in activation distribution. Activation threshold shifting adjusts the ReLU activation threshold to a positive value, pruning less influential neurons to improve sparsity. ProSparse achieves high activation sparsity in LLaMA2-7B (89.32%), LLaMA2-13B (88.80%), and MiniCPM-1B (87.89%), with performance comparable to their original Swish-activated versions. These models are among the most sparsely activated open-source LLaMA versions and competitive end-size models. Inference acceleration experiments show that higher activation sparsity significantly improves inference speed, achieving up to 4.52× speedup. ProSparse also demonstrates practical inference acceleration through approximate and accurate algorithms. Approximate algorithms, such as PowerInfer, achieve high speedup ratios but may have inaccuracies due to activation predictor errors. Accurate algorithms, such as sparse GPU operators, exploit input-side and output-side sparsity for efficient inference. Experiments show that higher activation sparsity improves inference speed and accuracy with both approximate and accurate algorithms. Analysis shows that the sparsity of ProSparse models is mainly influenced by the final-stage regularization factor. Progressive sparsity regularization is essential for maintaining performance during sparsity introduction. The sparsity distribution varies across datasets and layers, with higher sparsity in more formatted instruction tuning datasets and higher layers. ProSparse is a highly controllable method for sparsity adjustment, with the ability to reach high sparsity without performance degradation. Future work will explore sparsity in attention layers and optimization of FFN step (1).
Reach us at info@study.space