6 Feb 2024 | Zhengyan Zhang * 1 Yixin Song * 2 Guanghui Yu 3 Xu Han 1 Yankai Lin 4 Chaojun Xiao 1 Chenyang Song 1 Zhiyuan Liu 1 Zeyu Mi 2 Maosong Sun 1
The paper "ReLU² Wins: Discovering Efficient Activation Functions for Sparse LLMs" explores the potential of sparse computation in Large Language Models (LLMs) to reduce inference costs in low-resource scenarios. Traditional approaches focus on ReLU-based LLMs, leveraging the zeros in activation values. However, the authors broaden the scope to include non-zero activation values, proposing a general method to define neuron activation based on output magnitudes and a magnitude threshold. They demonstrate that non-ReLU LLMs also exhibit sparse activation, which is more pronounced in ReLU-based models.
To find the most efficient activation function for sparse computation, the authors propose a systematic framework to evaluate LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. Comprehensive experiments on LLMs using different activation functions (ReLU, SwiGLU, ReGLU, and ReLU²) show that models using ReLU² achieve the best performance across all three evaluation aspects. Specifically, ReLU² models show a higher sparsity ratio, better predictivity, and superior hardware affinity, leading to significant computational cost reductions.
The paper concludes by highlighting the potential of ReLU² as an efficient activation function for sparse LLMs and encourages further research in this area. The code for the experiments is made available to facilitate future studies.The paper "ReLU² Wins: Discovering Efficient Activation Functions for Sparse LLMs" explores the potential of sparse computation in Large Language Models (LLMs) to reduce inference costs in low-resource scenarios. Traditional approaches focus on ReLU-based LLMs, leveraging the zeros in activation values. However, the authors broaden the scope to include non-zero activation values, proposing a general method to define neuron activation based on output magnitudes and a magnitude threshold. They demonstrate that non-ReLU LLMs also exhibit sparse activation, which is more pronounced in ReLU-based models.
To find the most efficient activation function for sparse computation, the authors propose a systematic framework to evaluate LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. Comprehensive experiments on LLMs using different activation functions (ReLU, SwiGLU, ReGLU, and ReLU²) show that models using ReLU² achieve the best performance across all three evaluation aspects. Specifically, ReLU² models show a higher sparsity ratio, better predictivity, and superior hardware affinity, leading to significant computational cost reductions.
The paper concludes by highlighting the potential of ReLU² as an efficient activation function for sparse LLMs and encourages further research in this area. The code for the experiments is made available to facilitate future studies.