Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

8 Apr 2024 | Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda
This paper introduces a hybrid training and inference framework for Mixture-of-Experts (MoE) language models, called DS-MoE, which combines dense training with sparse inference. MoE models are known for their computational efficiency but require significantly more parameters than dense models, leading to higher memory usage and reduced efficiency in I/O-bounded scenarios. DS-MoE addresses this by training the model densely, ensuring all experts are activated during training, and then using sparse inference during deployment. This approach achieves strong computational and parameter efficiency, with DS-MoE models being computationally cheaper and more parameter-efficient than standard sparse MoEs while maintaining performance comparable to dense models. The DS-MoE framework includes two key components: a Mutual Information (MI) loss to balance expert load and a Mixture of Attention Head (MoA) block to optimize self-attention. The MI loss ensures even distribution of workload across experts and prevents underutilization of model capacity. The MoA block replaces the dense self-attention layer with a mixture of attention heads, improving efficiency. Experiments show that DS-MoE models outperform traditional sparse MoEs in terms of parameter efficiency and computational efficiency. They achieve performance comparable to dense models while activating only 30-40% of the parameters during inference. DS-MoE models also demonstrate better throughput in both computation-bounded and I/O-bounded scenarios. When compared to other MoE models, DS-MoE achieves the highest throughput, making it a promising approach for efficient large language model deployment.This paper introduces a hybrid training and inference framework for Mixture-of-Experts (MoE) language models, called DS-MoE, which combines dense training with sparse inference. MoE models are known for their computational efficiency but require significantly more parameters than dense models, leading to higher memory usage and reduced efficiency in I/O-bounded scenarios. DS-MoE addresses this by training the model densely, ensuring all experts are activated during training, and then using sparse inference during deployment. This approach achieves strong computational and parameter efficiency, with DS-MoE models being computationally cheaper and more parameter-efficient than standard sparse MoEs while maintaining performance comparable to dense models. The DS-MoE framework includes two key components: a Mutual Information (MI) loss to balance expert load and a Mixture of Attention Head (MoA) block to optimize self-attention. The MI loss ensures even distribution of workload across experts and prevents underutilization of model capacity. The MoA block replaces the dense self-attention layer with a mixture of attention heads, improving efficiency. Experiments show that DS-MoE models outperform traditional sparse MoEs in terms of parameter efficiency and computational efficiency. They achieve performance comparable to dense models while activating only 30-40% of the parameters during inference. DS-MoE models also demonstrate better throughput in both computation-bounded and I/O-bounded scenarios. When compared to other MoE models, DS-MoE achieves the highest throughput, making it a promising approach for efficient large language model deployment.
Reach us at info@study.space