Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

13 Mar 2020 | Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro
The paper "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" presents techniques for training large transformer models, specifically focusing on a simple and efficient intra-layer model parallel approach. This approach enables the training of transformer models with billions of parameters using 512 GPUs, achieving 15.1 PetaFLOPs sustained performance. The method does not require significant changes to the existing PyTorch framework and is orthogonal to pipeline model parallelism. The authors demonstrate the effectiveness of their approach by training an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT, achieving state-of-the-art results on various downstream tasks such as WikiText103, LAMBADA, and RACE. The paper also highlights the importance of careful placement of layer normalization in BERT-like models to improve performance as the model size increases.The paper "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" presents techniques for training large transformer models, specifically focusing on a simple and efficient intra-layer model parallel approach. This approach enables the training of transformer models with billions of parameters using 512 GPUs, achieving 15.1 PetaFLOPs sustained performance. The method does not require significant changes to the existing PyTorch framework and is orthogonal to pipeline model parallelism. The authors demonstrate the effectiveness of their approach by training an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT, achieving state-of-the-art results on various downstream tasks such as WikiText103, LAMBADA, and RACE. The paper also highlights the importance of careful placement of layer normalization in BERT-like models to improve performance as the model size increases.
Reach us at info@study.space