ReLU² Wins: Discovering Efficient Activation Functions for Sparse LLMs

ReLU² Wins: Discovering Efficient Activation Functions for Sparse LLMs

2024-02-06 | Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, Maosong Sun
This paper introduces ReLU², a new activation function that outperforms existing activation functions in sparse Large Language Models (LLMs). The authors propose a general activation definition based on neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. They conduct extensive experiments on LLMs using different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU², and evaluate them from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and hardware affinity. The results show that ReLU² achieves the best balance between performance and sparsity, with a performance degradation of less than 0.1% at a sparsity ratio close to 90%. Models using ReLU² also exhibit higher neuron activation predictivity and better hardware affinity compared to other activation functions. The study highlights the potential of ReLU² as an efficient activation function for sparse LLMs, enabling more efficient inference while preserving performance. The authors also provide a systematic framework for examining sparse computation and propose a threshold-finding method based on cumulative errors of tail truncation to quantify the impact of long-tailed neuron outputs on FFN computation. The findings suggest that ReLU² is a promising activation function for sparse LLMs, offering significant improvements in efficiency and performance.This paper introduces ReLU², a new activation function that outperforms existing activation functions in sparse Large Language Models (LLMs). The authors propose a general activation definition based on neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. They conduct extensive experiments on LLMs using different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU², and evaluate them from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and hardware affinity. The results show that ReLU² achieves the best balance between performance and sparsity, with a performance degradation of less than 0.1% at a sparsity ratio close to 90%. Models using ReLU² also exhibit higher neuron activation predictivity and better hardware affinity compared to other activation functions. The study highlights the potential of ReLU² as an efficient activation function for sparse LLMs, enabling more efficient inference while preserving performance. The authors also provide a systematic framework for examining sparse computation and propose a threshold-finding method based on cumulative errors of tail truncation to quantify the impact of long-tailed neuron outputs on FFN computation. The findings suggest that ReLU² is a promising activation function for sparse LLMs, offering significant improvements in efficiency and performance.
Reach us at info@study.space