[slides and audio] Dense Training%2C Sparse Inference%3A Rethinking Training of Mixture-of-Experts Language Models

The article introduces a novel approach called Dense Training and Sparse Inference (DS-MoE) for Mixture-of-Experts (MoE) language models, aiming to improve both parameter and computational efficiency. MoE models, while more efficient in computation-bounded scenarios, often require significantly more parameters than dense models, leading to higher memory usage and inefficiency in I/O-bounded tasks like autoregressive generation. DS-MoE addresses this by training the model densely, ensuring all experts are activated during training, and then using sparse inference during deployment. This method achieves performance comparable to dense models while activating only 30-40% of the parameters during inference, resulting in significant computational savings. The DS-MoE model is evaluated on various tasks and shows superior performance in terms of throughput and efficiency compared to both dense and sparse MoE models. The approach also incorporates a Mutual Information (MI) loss to balance the load across experts and ensure efficient inference. Experimental results demonstrate that DS-MoE models are more parameter-efficient, maintain performance parity with dense models, and offer faster inference speeds, particularly in large-scale models. The study highlights the effectiveness of combining dense training with sparse inference to achieve a balance between parameter efficiency and computational efficiency in MoE models.The article introduces a novel approach called Dense Training and Sparse Inference (DS-MoE) for Mixture-of-Experts (MoE) language models, aiming to improve both parameter and computational efficiency. MoE models, while more efficient in computation-bounded scenarios, often require significantly more parameters than dense models, leading to higher memory usage and inefficiency in I/O-bounded tasks like autoregressive generation. DS-MoE addresses this by training the model densely, ensuring all experts are activated during training, and then using sparse inference during deployment. This method achieves performance comparable to dense models while activating only 30-40% of the parameters during inference, resulting in significant computational savings. The DS-MoE model is evaluated on various tasks and shows superior performance in terms of throughput and efficiency compared to both dense and sparse MoE models. The approach also incorporates a Mutual Information (MI) loss to balance the load across experts and ensure efficient inference. Experimental results demonstrate that DS-MoE models are more parameter-efficient, maintain performance parity with dense models, and offer faster inference speeds, particularly in large-scale models. The study highlights the effectiveness of combining dense training with sparse inference to achieve a balance between parameter efficiency and computational efficiency in MoE models.

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

8 Apr 2024 | Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda