Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

13 Mar 2020 | Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro
Megatron-LM is a framework for training very large transformer language models using model parallelism. The paper presents a simple and efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. This approach does not require a new compiler or library changes, is orthogonal and complementary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. The authors demonstrate this approach by converging transformer-based models up to 8.3 billion parameters using 512 GPUs, achieving 15.1 PetaFLOPs sustained across the entire application with 76% scaling efficiency compared to a strong single GPU baseline. They also show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model, they achieve state of the art results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Their BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%). The paper also discusses the challenges of training large language models, including memory constraints and the need for efficient optimization techniques. The authors propose a model parallel approach that splits the model across multiple accelerators, which not only alleviates memory pressure but also increases the amount of parallelism independently of the microbatch size. They also show that scaling the model size results in improved accuracies for both GPT-2 and BERT models. The paper concludes that their approach is simple to implement, requiring only a few extra all-reduce operations added to the forward and backward pass, and does not require a compiler, making it orthogonal and complementary to pipeline model parallelism. The code is open-sourced at https://github.com/NVIDIA/Megatron-LM.Megatron-LM is a framework for training very large transformer language models using model parallelism. The paper presents a simple and efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. This approach does not require a new compiler or library changes, is orthogonal and complementary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. The authors demonstrate this approach by converging transformer-based models up to 8.3 billion parameters using 512 GPUs, achieving 15.1 PetaFLOPs sustained across the entire application with 76% scaling efficiency compared to a strong single GPU baseline. They also show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model, they achieve state of the art results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Their BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%). The paper also discusses the challenges of training large language models, including memory constraints and the need for efficient optimization techniques. The authors propose a model parallel approach that splits the model across multiple accelerators, which not only alleviates memory pressure but also increases the amount of parallelism independently of the microbatch size. They also show that scaling the model size results in improved accuracies for both GPT-2 and BERT models. The paper concludes that their approach is simple to implement, requiring only a few extra all-reduce operations added to the forward and backward pass, and does not require a compiler, making it orthogonal and complementary to pipeline model parallelism. The code is open-sourced at https://github.com/NVIDIA/Megatron-LM.
Reach us at info@study.space