Learn To be Efficient: Build Structured Sparsity in Large Language Models

Learn To be Efficient: Build Structured Sparsity in Large Language Models

3 Jun 2024 | Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, and Atul Prakash
This paper introduces Learn-To-be-Efficient (LTE), a novel training algorithm for large language models (LLMs) to achieve more structured activation sparsity, thereby improving inference efficiency. LLMs typically have high computational and memory demands during inference, but activation sparsity in feed-forward network (FFN) layers can significantly reduce this overhead. LTE aims to train LLMs to activate fewer neurons while maintaining performance, enabling more efficient inference. LTE is designed to learn structured sparsity by introducing an efficiency loss penalty that encourages models to activate fewer neurons in FFN layers. It also employs a threshold-based Sigmoid routing strategy to select experts and a two-stage training mechanism to improve training stability. Unlike existing methods that focus on ReLU-based models, LTE can be applied to LLMs like LLaMA using non-ReLU activations. The algorithm also integrates a hardware-aware custom kernel implementation to further reduce inference latency. Extensive evaluations on language understanding, generation, and instruction tuning tasks show that LTE consistently outperforms state-of-the-art baselines. For example, LTE reduces LLaMA2-7B inference latency by 25% at 50% sparsity. The method is effective even on LLMs with soft activation functions, demonstrating its versatility and efficiency. The paper also discusses the limitations of existing methods, such as the challenges of training routers and selecting the right experts in serving. LTE addresses these challenges by introducing a novel two-stage training algorithm that balances inference efficiency and task performance. The method is evaluated on various datasets and models, showing its effectiveness in improving inference speed and efficiency. Overall, LTE provides a promising approach to enhance the efficiency of LLMs by learning structured activation sparsity, making them more accessible and efficient for a broader range of applications.This paper introduces Learn-To-be-Efficient (LTE), a novel training algorithm for large language models (LLMs) to achieve more structured activation sparsity, thereby improving inference efficiency. LLMs typically have high computational and memory demands during inference, but activation sparsity in feed-forward network (FFN) layers can significantly reduce this overhead. LTE aims to train LLMs to activate fewer neurons while maintaining performance, enabling more efficient inference. LTE is designed to learn structured sparsity by introducing an efficiency loss penalty that encourages models to activate fewer neurons in FFN layers. It also employs a threshold-based Sigmoid routing strategy to select experts and a two-stage training mechanism to improve training stability. Unlike existing methods that focus on ReLU-based models, LTE can be applied to LLMs like LLaMA using non-ReLU activations. The algorithm also integrates a hardware-aware custom kernel implementation to further reduce inference latency. Extensive evaluations on language understanding, generation, and instruction tuning tasks show that LTE consistently outperforms state-of-the-art baselines. For example, LTE reduces LLaMA2-7B inference latency by 25% at 50% sparsity. The method is effective even on LLMs with soft activation functions, demonstrating its versatility and efficiency. The paper also discusses the limitations of existing methods, such as the challenges of training routers and selecting the right experts in serving. LTE addresses these challenges by introducing a novel two-stage training algorithm that balances inference efficiency and task performance. The method is evaluated on various datasets and models, showing its effectiveness in improving inference speed and efficiency. Overall, LTE provides a promising approach to enhance the efficiency of LLMs by learning structured activation sparsity, making them more accessible and efficient for a broader range of applications.
Reach us at info@futurestudyspace.com