19 Aug 2024 | Zexuan Zhong†, Mengzhou Xia†, Danqi Chen†, Mike Lewis†
Lory is a novel approach that scales fully differentiable Mixture-of-Experts (MoE) architectures to autoregressive language model pre-training. It introduces two key techniques: (1) a causal segment routing strategy that efficiently merges experts while preserving the autoregressive nature of language models, and (2) a similarity-based data batching method that groups semantically similar documents to encourage expert specialization. Lory models are pre-trained from scratch on 150 billion tokens, with up to 32 experts and 30 billion parameters. Experimental results show significant performance gains over parameter-matched dense models in both perplexity (+13.9%) and various downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. The trained experts capture domain-level specialization without supervision, highlighting the potential of fully differentiable MoE architectures for language model pre-training.Lory is a novel approach that scales fully differentiable Mixture-of-Experts (MoE) architectures to autoregressive language model pre-training. It introduces two key techniques: (1) a causal segment routing strategy that efficiently merges experts while preserving the autoregressive nature of language models, and (2) a similarity-based data batching method that groups semantically similar documents to encourage expert specialization. Lory models are pre-trained from scratch on 150 billion tokens, with up to 32 experts and 30 billion parameters. Experimental results show significant performance gains over parameter-matched dense models in both perplexity (+13.9%) and various downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. The trained experts capture domain-level specialization without supervision, highlighting the potential of fully differentiable MoE architectures for language model pre-training.