MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

23 Feb 2024 | Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
MegaScale is a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. It addresses the challenges of training efficiency and stability at such a scale by employing a full-stack approach that co-designs algorithmic and system components, including model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34× compared to Megatron-LM. The system enables high training efficiency and stability by applying two systems principles: algorithm-system co-design and in-depth observability. MegaScale is a specialized system tailored for LLM training, incorporating effective optimization techniques such as parallel transformer block, sliding window attention, and LAMB optimizer. It leverages mixed parallelism strategies that combine data parallelism, pipeline parallelism, tensor parallelism, and sequence parallelism. The system also applies prefetching and tree-based loading to optimize the data pipeline, non-blocking asynchronous operations to eliminate global barriers, and a custom network topology to reduce ECMP hash conflicts and tune retransmit timeout parameters for high network performance. Stability problems in large-scale systems are notoriously hard to diagnose and fix, but MegaScale uses in-depth observability to build a set of diagnosis tools that monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. The system also includes a robust training framework to automate fault localization and recovery, and a suite of diagnostic tests to identify nodes causing disruptions. MegaScale is deployed in the company's datacenters to train LLMs for their products, achieving high training efficiency and stability. The system has been used to train a proprietary model with hundreds of billions of parameters on multi-trillion tokens for several weeks, with the loss continuing to converge and the training process being repaired and recovered over 100 times in presence of failures. The company is working on open-sourcing components that can benefit the community on GitHub.MegaScale is a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. It addresses the challenges of training efficiency and stability at such a scale by employing a full-stack approach that co-designs algorithmic and system components, including model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34× compared to Megatron-LM. The system enables high training efficiency and stability by applying two systems principles: algorithm-system co-design and in-depth observability. MegaScale is a specialized system tailored for LLM training, incorporating effective optimization techniques such as parallel transformer block, sliding window attention, and LAMB optimizer. It leverages mixed parallelism strategies that combine data parallelism, pipeline parallelism, tensor parallelism, and sequence parallelism. The system also applies prefetching and tree-based loading to optimize the data pipeline, non-blocking asynchronous operations to eliminate global barriers, and a custom network topology to reduce ECMP hash conflicts and tune retransmit timeout parameters for high network performance. Stability problems in large-scale systems are notoriously hard to diagnose and fix, but MegaScale uses in-depth observability to build a set of diagnosis tools that monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. The system also includes a robust training framework to automate fault localization and recovery, and a suite of diagnostic tests to identify nodes causing disruptions. MegaScale is deployed in the company's datacenters to train LLMs for their products, achieving high training efficiency and stability. The system has been used to train a proprietary model with hundreds of billions of parameters on multi-trillion tokens for several weeks, with the loss continuing to converge and the training process being repaired and recovered over 100 times in presence of failures. The company is working on open-sourcing components that can benefit the community on GitHub.
Reach us at info@study.space