23 (2022) 1-40 | William Fedus*, Barret Zoph*, Noam Shazeer
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer
Abstract: Switch Transformers simplify and improve upon Mixture of Experts (MoE) models, enabling efficient, sparse training with lower computational costs. The Switch Transformer architecture reduces communication and computational costs, and allows for training large sparse models with lower precision (bfloat16). The model is based on T5-Base and T5-Large, achieving up to 7x increases in pre-training speed with the same computational resources. The Switch Transformer also improves multilingual learning across 101 languages and achieves a 4x speedup over the T5-XXL model. The paper presents the design, training, and scaling properties of the Switch Transformer, as well as its performance on downstream tasks and multilingual learning. The Switch Transformer is also shown to be effective in both dense and sparse settings, and can be distilled into smaller dense models while preserving 30% of the sparse model quality gain. The paper also discusses the benefits of data, model, and expert-parallelism in scaling the Switch Transformer to trillion parameter models.Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer
Abstract: Switch Transformers simplify and improve upon Mixture of Experts (MoE) models, enabling efficient, sparse training with lower computational costs. The Switch Transformer architecture reduces communication and computational costs, and allows for training large sparse models with lower precision (bfloat16). The model is based on T5-Base and T5-Large, achieving up to 7x increases in pre-training speed with the same computational resources. The Switch Transformer also improves multilingual learning across 101 languages and achieves a 4x speedup over the T5-XXL model. The paper presents the design, training, and scaling properties of the Switch Transformer, as well as its performance on downstream tasks and multilingual learning. The Switch Transformer is also shown to be effective in both dense and sparse settings, and can be distilled into smaller dense models while preserving 30% of the sparse model quality gain. The paper also discusses the benefits of data, model, and expert-parallelism in scaling the Switch Transformer to trillion parameter models.