[slides and audio] Outrageously Large Neural Networks%3A The Sparsely-Gated Mixture-of-Experts Layer

The paper introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, a novel neural network component designed to significantly increase model capacity while maintaining computational efficiency. The MoE layer consists of a set of feed-forward sub-networks (experts) and a trainable gating network that selects a sparse combination of these experts for each input. This approach addresses the challenges of conditional computation, where parts of the network are active on a per-example basis, and has been shown to achieve over 1000x improvements in model capacity with minimal computational overhead. The authors apply the MoE layer to language modeling and machine translation tasks, achieving state-of-the-art results on large datasets such as the 1-billion-word language modeling benchmark and the Google News corpus. They also demonstrate the effectiveness of the MoE layer in multilingual machine translation, outperforming both monolingual and multilingual models. The paper provides detailed experimental results and discusses the design considerations and performance challenges of conditional computation.The paper introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, a novel neural network component designed to significantly increase model capacity while maintaining computational efficiency. The MoE layer consists of a set of feed-forward sub-networks (experts) and a trainable gating network that selects a sparse combination of these experts for each input. This approach addresses the challenges of conditional computation, where parts of the network are active on a per-example basis, and has been shown to achieve over 1000x improvements in model capacity with minimal computational overhead. The authors apply the MoE layer to language modeling and machine translation tasks, achieving state-of-the-art results on large datasets such as the 1-billion-word language modeling benchmark and the Google News corpus. They also demonstrate the effectiveness of the MoE layer in multilingual machine translation, outperforming both monolingual and multilingual models. The paper provides detailed experimental results and discusses the design considerations and performance challenges of conditional computation.

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

23 Jan 2017 | Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean