Understanding Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

The paper "Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference" by Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. Panda addresses the challenges of deploying GPT-MoE models for parallel inference on distributed systems, particularly the high communication overhead due to Alltoall operations for expert routing and aggregation. The authors propose ExFlow, a lightweight optimization technique that leverages inter-layer expert affinity to reduce communication overhead. Unlike previous methods, ExFlow can be directly applied to pre-trained MoE models without fine-tuning or accuracy degradation. By designing a context-coherent expert parallelism and using integer programming to capture expert affinity, ExFlow reduces up to 67% of tokens' cross-GPU routing latency. The solution is tested on various hardware configurations and topologies, showing significant improvements in inference throughput compared to existing methods. The paper also provides insights into how expert affinity evolves during model training and its insensitivity to dataset distribution.The paper "Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference" by Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. Panda addresses the challenges of deploying GPT-MoE models for parallel inference on distributed systems, particularly the high communication overhead due to Alltoall operations for expert routing and aggregation. The authors propose ExFlow, a lightweight optimization technique that leverages inter-layer expert affinity to reduce communication overhead. Unlike previous methods, ExFlow can be directly applied to pre-trained MoE models without fine-tuning or accuracy degradation. By designing a context-coherent expert parallelism and using integer programming to capture expert affinity, ExFlow reduces up to 67% of tokens' cross-GPU routing latency. The solution is tested on various hardware configurations and topologies, showing significant improvements in inference throughput compared to existing methods. The paper also provides insights into how expert affinity evolves during model training and its insensitivity to dataset distribution.

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

17 Jan 2024 | Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK) Panda