[slides and audio] Multi-Head Mixture-of-Experts

The paper introduces Multi-Head Mixture-of-Experts (MH-MoE), a novel approach to enhance the performance and scalability of Sparse Mixtures of Experts (SMoE) models. SMoE models scale model capacity without significantly increasing training and inference costs, but they suffer from low expert activation and lack fine-grained analytical capabilities for multiple semantic concepts within individual tokens. MH-MoE addresses these issues by employing a multi-head mechanism to split each input token into multiple sub-tokens, which are then distributed to a diverse set of experts for parallel processing. This approach enables more efficient expert activation and deeper context understanding. The paper demonstrates the effectiveness of MH-MoE through extensive experiments on three tasks: English-focused language modeling, multi-lingual language modeling, and masked multi-modality modeling. The results show that MH-MoE achieves higher expert activation, better scalability, and finer-grained understanding compared to existing models. The implementation of MH-MoE is straightforward and decoupled from other SMoE frameworks, making it easy to integrate and enhance performance.The paper introduces Multi-Head Mixture-of-Experts (MH-MoE), a novel approach to enhance the performance and scalability of Sparse Mixtures of Experts (SMoE) models. SMoE models scale model capacity without significantly increasing training and inference costs, but they suffer from low expert activation and lack fine-grained analytical capabilities for multiple semantic concepts within individual tokens. MH-MoE addresses these issues by employing a multi-head mechanism to split each input token into multiple sub-tokens, which are then distributed to a diverse set of experts for parallel processing. This approach enables more efficient expert activation and deeper context understanding. The paper demonstrates the effectiveness of MH-MoE through extensive experiments on three tasks: English-focused language modeling, multi-lingual language modeling, and masked multi-modality modeling. The results show that MH-MoE achieves higher expert activation, better scalability, and finer-grained understanding compared to existing models. The implementation of MH-MoE is straightforward and decoupled from other SMoE frameworks, making it easy to integrate and enhance performance.

Multi-Head Mixture-of-Experts

23 Apr 2024 | Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei