Understanding Turbo Sparse%3A Achieving LLM SOTA Performance with Minimal Activated Parameters

The paper "Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters" by Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen Li Ma, Zeyu Mi, and Haibo Chen introduces a novel method to enhance the efficiency of large language models (LLMs) by exploiting activation sparsity. The authors propose a new activation function called dReLU, which is designed to improve sparsity in LLMs while maintaining or even improving performance. They also leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. Key contributions of the paper include: 1. **Efficient dReLU Activation Function**: The dReLU function is designed to achieve higher sparsity without compromising performance. 2. **Sparse Activated Models**: The authors release sparsely-activated models, TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B, which demonstrate better performance compared to their original counterparts. 3. **Practical Inference Speedup**: The models achieve a 2-5x speedup in decoding, with TurboSparse-Mixtral-47B achieving an inference speed of 11 tokens per second on mobile phones. The paper addresses the limitations of existing ReLUfication methods, which often struggle to achieve significant sparsity and may lead to performance degradation. By analyzing the activation distribution and proposing dReLU, the authors show that their method can achieve high sparsity (up to 90%) without sacrificing performance. Additionally, they demonstrate that sparse activation patterns in MoE models can further enhance efficiency. The evaluation results show that the proposed method not only improves performance but also significantly reduces computational resources, making LLMs more accessible and environmentally friendly. The paper concludes by highlighting the broader impact of their work, which can help democratize access to advanced AI technologies, particularly for smaller organizations and educational institutions.The paper "Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters" by Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen Li Ma, Zeyu Mi, and Haibo Chen introduces a novel method to enhance the efficiency of large language models (LLMs) by exploiting activation sparsity. The authors propose a new activation function called dReLU, which is designed to improve sparsity in LLMs while maintaining or even improving performance. They also leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. Key contributions of the paper include: 1. **Efficient dReLU Activation Function**: The dReLU function is designed to achieve higher sparsity without compromising performance. 2. **Sparse Activated Models**: The authors release sparsely-activated models, TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B, which demonstrate better performance compared to their original counterparts. 3. **Practical Inference Speedup**: The models achieve a 2-5x speedup in decoding, with TurboSparse-Mixtral-47B achieving an inference speed of 11 tokens per second on mobile phones. The paper addresses the limitations of existing ReLUfication methods, which often struggle to achieve significant sparsity and may lead to performance degradation. By analyzing the activation distribution and proposing dReLU, the authors show that their method can achieve high sparsity (up to 90%) without sacrificing performance. Additionally, they demonstrate that sparse activation patterns in MoE models can further enhance efficiency. The evaluation results show that the proposed method not only improves performance but also significantly reduces computational resources, making LLMs more accessible and environmentally friendly. The paper concludes by highlighting the broader impact of their work, which can help democratize access to advanced AI technologies, particularly for smaller organizations and educational institutions.

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

11 Jun 2024 | Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen