12 Mar 2024 | Quzhe Huang1*, Zhenwei An2*, Nan Zhuang2*, Mingxu Tao1, Chen Zhang1, Yang Jin1, Kun Xu3, Kun Xu3, Liwei Chen3, Songfang Huang2, Yansong Feng1
This paper introduces a novel dynamic expert selection framework for Mixture of Experts (MoE) models, aiming to enhance computational efficiency and model performance by adjusting the number of activated experts based on input complexity. Unlike traditional Top-K routing, which activates a fixed number of experts regardless of input complexity, the proposed dynamic routing method dynamically selects experts based on the confidence level in expert selection. This allows for more efficient utilization of computational resources, activating more experts for complex tasks and fewer for simpler tasks. Extensive evaluations show that the dynamic routing method outperforms conventional Top-2 routing with an average improvement of 0.7% while activating fewer than 90% of parameters. Further analysis reveals that the model dispatches more experts to tasks requiring complex reasoning, such as BBH, confirming its ability to dynamically allocate computational resources. The findings also highlight variations in the number of experts needed across different layers of the transformer model, suggesting potential for designing heterogeneous MoE frameworks. The code and models are available at https://github.com/ZhenweiAn/Dynamic_MoE.This paper introduces a novel dynamic expert selection framework for Mixture of Experts (MoE) models, aiming to enhance computational efficiency and model performance by adjusting the number of activated experts based on input complexity. Unlike traditional Top-K routing, which activates a fixed number of experts regardless of input complexity, the proposed dynamic routing method dynamically selects experts based on the confidence level in expert selection. This allows for more efficient utilization of computational resources, activating more experts for complex tasks and fewer for simpler tasks. Extensive evaluations show that the dynamic routing method outperforms conventional Top-2 routing with an average improvement of 0.7% while activating fewer than 90% of parameters. Further analysis reveals that the model dispatches more experts to tasks requiring complex reasoning, such as BBH, confirming its ability to dynamically allocate computational resources. The findings also highlight variations in the number of experts needed across different layers of the transformer model, suggesting potential for designing heterogeneous MoE frameworks. The code and models are available at https://github.com/ZhenweiAn/Dynamic_MoE.