3 Jun 2024 | Haizhong Zheng1, Xiaoyan Bai1, Xueshen Liu1, Z. Morley Mao1, Beidi Chen2, Fan Lai3, and Atul Prakash1
This paper introduces a novel training algorithm called Learn-To-be-Efficient (LTE) to enhance the efficiency of large language models (LLMs) by achieving more structured activation sparsity. LTE aims to train LLMs to activate fewer neurons during inference, improving both performance and efficiency. Unlike existing methods that focus on post-training activation sparsity, LTE learns to create structured sparsity during training, making it applicable to models with non-ReLU activations, such as LLaMA. The algorithm includes an efficiency loss penalty to encourage the model to activate fewer neurons while maintaining task performance. It also employs a threshold-based Sigmoid routing strategy to select experts adaptively for different inputs and layers. Extensive evaluations on various tasks, including natural language understanding, generation, and instruction tuning, show that LTE consistently outperforms state-of-the-art baselines, achieving up to 2.59x FLOPs speed-up on LLaMA2-7B with 50% sparsity and reducing inference latency by 25%. The paper also discusses the hardware-efficient implementation of LTE using a custom CUDA kernel, further enhancing inference speed.This paper introduces a novel training algorithm called Learn-To-be-Efficient (LTE) to enhance the efficiency of large language models (LLMs) by achieving more structured activation sparsity. LTE aims to train LLMs to activate fewer neurons during inference, improving both performance and efficiency. Unlike existing methods that focus on post-training activation sparsity, LTE learns to create structured sparsity during training, making it applicable to models with non-ReLU activations, such as LLaMA. The algorithm includes an efficiency loss penalty to encourage the model to activate fewer neurons while maintaining task performance. It also employs a threshold-based Sigmoid routing strategy to select experts adaptively for different inputs and layers. Extensive evaluations on various tasks, including natural language understanding, generation, and instruction tuning, show that LTE consistently outperforms state-of-the-art baselines, achieving up to 2.59x FLOPs speed-up on LLaMA2-7B with 50% sparsity and reducing inference latency by 25%. The paper also discusses the hardware-efficient implementation of LTE using a custom CUDA kernel, further enhancing inference speed.