MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

23 Feb 2024 | Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiya Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
The paper presents the design, implementation, and engineering experience of MegaScale, a production system for training large language models (LLMs) on over 10,000 GPUs. The authors address the challenges of training efficiency and stability at this scale, focusing on algorithmic and system component co-design, computation and communication overlapping, operator optimization, data pipeline optimization, and network performance tuning. They achieve 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving MFU by 1.34× compared to Megatron-LM. The paper also details their approach to maintaining stability, including the development of diagnosis tools for monitoring system components and events, and techniques for fault tolerance and straggler mitigation. MegaScale has been successfully deployed in datacenters to train LLMs for various products, demonstrating its effectiveness in both efficiency and stability.The paper presents the design, implementation, and engineering experience of MegaScale, a production system for training large language models (LLMs) on over 10,000 GPUs. The authors address the challenges of training efficiency and stability at this scale, focusing on algorithmic and system component co-design, computation and communication overlapping, operator optimization, data pipeline optimization, and network performance tuning. They achieve 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving MFU by 1.34× compared to Megatron-LM. The paper also details their approach to maintaining stability, including the development of diagnosis tools for monitoring system components and events, and techniques for fault tolerance and straggler mitigation. MegaScale has been successfully deployed in datacenters to train LLMs for various products, demonstrating its effectiveness in both efficiency and stability.
Reach us at info@study.space