11 Jun 2024 | Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen
TurboSparse is a method that achieves state-of-the-art performance in large language models (LLMs) with minimal activated parameters. The paper introduces a novel activation function called dReLU, which improves activation sparsity in LLMs. Alongside this, a high-quality training data mixture ratio is used to facilitate effective sparsification. The method also leverages sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying the neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Notably, on mobile phones, TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. The models are available at https://huggingface.co/PowerInfer. The paper also shows that dReLU-based sparsified models, particularly TurboSparse-Mixtral-47B, consistently outperform similar models on the Open LLM Leaderboard. The method achieves 90% sparsity in the FFN while maintaining model performance. The paper also demonstrates that the sparsity phenomenon persists in MoE models, and that ReLUfication can be extended to MoE models. The results show that LLMs with ReLU-based intrinsic activation sparsity can maintain the same or better performance while significantly reducing FLOPs. The paper also evaluates the practical inference speedup of the models, showing that they achieve a 2.83× generation speedup. The key contributions of the paper include an efficient dReLU activation function, sparse activated models, and practical inference speedup. The paper also discusses the limitations of existing ReLUfication methods and proposes a new method to address these limitations. The paper also evaluates the performance of the models on various downstream tasks, showing that they outperform other models. The paper also discusses the broader impact of the method, including its potential to reduce computational demands and make advanced AI technologies more accessible.TurboSparse is a method that achieves state-of-the-art performance in large language models (LLMs) with minimal activated parameters. The paper introduces a novel activation function called dReLU, which improves activation sparsity in LLMs. Alongside this, a high-quality training data mixture ratio is used to facilitate effective sparsification. The method also leverages sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying the neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Notably, on mobile phones, TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. The models are available at https://huggingface.co/PowerInfer. The paper also shows that dReLU-based sparsified models, particularly TurboSparse-Mixtral-47B, consistently outperform similar models on the Open LLM Leaderboard. The method achieves 90% sparsity in the FFN while maintaining model performance. The paper also demonstrates that the sparsity phenomenon persists in MoE models, and that ReLUfication can be extended to MoE models. The results show that LLMs with ReLU-based intrinsic activation sparsity can maintain the same or better performance while significantly reducing FLOPs. The paper also evaluates the practical inference speedup of the models, showing that they achieve a 2.83× generation speedup. The key contributions of the paper include an efficient dReLU activation function, sparse activated models, and practical inference speedup. The paper also discusses the limitations of existing ReLUfication methods and proposes a new method to address these limitations. The paper also evaluates the performance of the models on various downstream tasks, showing that they outperform other models. The paper also discusses the broader impact of the method, including its potential to reduce computational demands and make advanced AI technologies more accessible.