2024 | Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis
Lory is a novel approach for pre-training fully differentiable Mixture-of-Experts (MoE) language models. It introduces two key techniques: (1) causal segment routing, which enables efficient expert merging while preserving the autoregressive nature of language models, and (2) similarity-based data batching, which groups semantically similar documents to encourage expert specialization. Lory models are pre-trained on 150B tokens with up to 32 experts and 30B parameters. Experimental results show significant performance gains over parameter-matched dense models in both perplexity (+13.9%) and downstream tasks (+1.5%-11.1%). Despite using segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. The trained experts capture domain-level specialization without supervision, demonstrating the potential of fully differentiable MoE architectures for language model pre-training. Lory is the first fully differentiable MoE model suitable for language model pre-training and demonstrates its effectiveness at scale. The work highlights the potential of fully differentiable MoE architectures in cultivating specialized experts and encourages further research in this area.Lory is a novel approach for pre-training fully differentiable Mixture-of-Experts (MoE) language models. It introduces two key techniques: (1) causal segment routing, which enables efficient expert merging while preserving the autoregressive nature of language models, and (2) similarity-based data batching, which groups semantically similar documents to encourage expert specialization. Lory models are pre-trained on 150B tokens with up to 32 experts and 30B parameters. Experimental results show significant performance gains over parameter-matched dense models in both perplexity (+13.9%) and downstream tasks (+1.5%-11.1%). Despite using segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. The trained experts capture domain-level specialization without supervision, demonstrating the potential of fully differentiable MoE architectures for language model pre-training. Lory is the first fully differentiable MoE model suitable for language model pre-training and demonstrates its effectiveness at scale. The work highlights the potential of fully differentiable MoE architectures in cultivating specialized experts and encourages further research in this area.