GShard is a module designed to scale large neural network models by enabling conditional computation and automatic sharding. It consists of lightweight annotation APIs and an extension to the XLA compiler, allowing users to express parallel computation patterns with minimal changes to existing model code. GShard has been used to scale a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts (MoE) layers beyond 600 billion parameters. The model was trained on 2048 TPU v3 accelerators in 4 days, achieving superior translation quality from 100 languages to English compared to prior art. The paper discusses the challenges of scaling neural networks, including architecture-specific model parallelism support, super-linear scaling of computation cost, infrastructure scalability, and the need for efficient partitioning strategies. GShard addresses these challenges by providing sub-linear scaling, separating model description from partitioning implementation, and scalable compilers. The paper also details the implementation of the MoE layer in linear algebra, the GShard annotation API for parallel execution, and the XLA SPMD partitioner for GShard. Finally, the authors demonstrate the effectiveness of GShard through experiments on a massively multilingual machine translation task, showing improved translation quality and reduced training time.GShard is a module designed to scale large neural network models by enabling conditional computation and automatic sharding. It consists of lightweight annotation APIs and an extension to the XLA compiler, allowing users to express parallel computation patterns with minimal changes to existing model code. GShard has been used to scale a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts (MoE) layers beyond 600 billion parameters. The model was trained on 2048 TPU v3 accelerators in 4 days, achieving superior translation quality from 100 languages to English compared to prior art. The paper discusses the challenges of scaling neural networks, including architecture-specific model parallelism support, super-linear scaling of computation cost, infrastructure scalability, and the need for efficient partitioning strategies. GShard addresses these challenges by providing sub-linear scaling, separating model description from partitioning implementation, and scalable compilers. The paper also details the implementation of the MoE layer in linear algebra, the GShard annotation API for parallel execution, and the XLA SPMD partitioner for GShard. Finally, the authors demonstrate the effectiveness of GShard through experiments on a massively multilingual machine translation task, showing improved translation quality and reduced training time.