Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

2024 | Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis
Lory is a novel approach for pre-training fully differentiable Mixture-of-Experts (MoE) language models. It introduces two key techniques: (1) causal segment routing, which enables efficient expert merging while preserving the autoregressive nature of language models, and (2) similarity-based data batching, which groups semantically similar documents to encourage expert specialization. Lory models are pre-trained on 150B tokens with up to 32 experts and 30B parameters. Experimental results show significant performance gains over parameter-matched dense models in both perplexity (+13.9%) and downstream tasks (+1.5%-11.1%). Despite using segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. The trained experts capture domain-level specialization without supervision, demonstrating the potential of fully differentiable MoE architectures for language model pre-training. Lory is the first fully differentiable MoE model suitable for language model pre-training and demonstrates its effectiveness at scale. The work highlights the potential of fully differentiable MoE architectures in cultivating specialized experts and encourages further research in this area.Lory is a novel approach for pre-training fully differentiable Mixture-of-Experts (MoE) language models. It introduces two key techniques: (1) causal segment routing, which enables efficient expert merging while preserving the autoregressive nature of language models, and (2) similarity-based data batching, which groups semantically similar documents to encourage expert specialization. Lory models are pre-trained on 150B tokens with up to 32 experts and 30B parameters. Experimental results show significant performance gains over parameter-matched dense models in both perplexity (+13.9%) and downstream tasks (+1.5%-11.1%). Despite using segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. The trained experts capture domain-level specialization without supervision, demonstrating the potential of fully differentiable MoE architectures for language model pre-training. Lory is the first fully differentiable MoE model suitable for language model pre-training and demonstrates its effectiveness at scale. The work highlights the potential of fully differentiable MoE architectures in cultivating specialized experts and encourages further research in this area.
Reach us at info@study.space