Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

17 Jan 2024 | Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK) Panda
This paper introduces ExFlow, a lightweight optimization technique to accelerate the inference of Mixture-of-Experts (MoE) models in distributed systems. The key idea is to exploit inter-layer expert affinity, which refers to the tendency of tokens to be routed to similar experts across consecutive layers in pre-trained MoE models. By leveraging this affinity, ExFlow reduces the need for Alltoall communication, which is a major bottleneck in MoE inference. Unlike previous methods that require two Alltoall operations per MoE layer, ExFlow only needs one, significantly reducing communication overhead. The paper demonstrates that pre-trained GPT MoE models implicitly exhibit strong inter-layer expert affinity. By analyzing the conditional probability of token routing across layers, the authors design an efficient integer programming model to capture this affinity and optimize expert placement. This approach reduces cross-GPU routing latency by up to 67% and improves inference throughput by up to 2.2x compared to existing methods like Deepspeed-MoE. ExFlow is implemented in a context-coherent manner, ensuring that tokens can perform in-place attention computation on their current GPU, eliminating the need for cross-GPU communication. This design is agnostic to the specific MoE model and can be applied to various pre-trained GPT MoE models without retraining. The paper also shows that expert affinity is stable across different datasets and hardware configurations, making ExFlow a robust solution for MoE inference. The results show that ExFlow achieves significant improvements in both communication efficiency and inference speed, making it a promising approach for accelerating MoE models in distributed systems. The paper also highlights the importance of expert affinity in model training and provides insights into how it evolves during training. Overall, ExFlow represents a novel and effective approach to optimizing MoE inference by leveraging inter-layer expert affinity.This paper introduces ExFlow, a lightweight optimization technique to accelerate the inference of Mixture-of-Experts (MoE) models in distributed systems. The key idea is to exploit inter-layer expert affinity, which refers to the tendency of tokens to be routed to similar experts across consecutive layers in pre-trained MoE models. By leveraging this affinity, ExFlow reduces the need for Alltoall communication, which is a major bottleneck in MoE inference. Unlike previous methods that require two Alltoall operations per MoE layer, ExFlow only needs one, significantly reducing communication overhead. The paper demonstrates that pre-trained GPT MoE models implicitly exhibit strong inter-layer expert affinity. By analyzing the conditional probability of token routing across layers, the authors design an efficient integer programming model to capture this affinity and optimize expert placement. This approach reduces cross-GPU routing latency by up to 67% and improves inference throughput by up to 2.2x compared to existing methods like Deepspeed-MoE. ExFlow is implemented in a context-coherent manner, ensuring that tokens can perform in-place attention computation on their current GPU, eliminating the need for cross-GPU communication. This design is agnostic to the specific MoE model and can be applied to various pre-trained GPT MoE models without retraining. The paper also shows that expert affinity is stable across different datasets and hardware configurations, making ExFlow a robust solution for MoE inference. The results show that ExFlow achieves significant improvements in both communication efficiency and inference speed, making it a promising approach for accelerating MoE models in distributed systems. The paper also highlights the importance of expert affinity in model training and provides insights into how it evolves during training. Overall, ExFlow represents a novel and effective approach to optimizing MoE inference by leveraging inter-layer expert affinity.
Reach us at info@study.space
[slides] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference | StudySpace