The paper presents the design, implementation, and engineering experience of MegaScale, a production system for training large language models (LLMs) on over 10,000 GPUs. The authors address the challenges of training efficiency and stability at this scale, focusing on algorithmic and system component co-design, computation and communication overlapping, operator optimization, data pipeline optimization, and network performance tuning. They achieve 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving MFU by 1.34× compared to Megatron-LM. The paper also details their approach to maintaining stability, including the development of diagnosis tools for monitoring system components and events, and techniques for fault tolerance and straggler mitigation. MegaScale has been successfully deployed in datacenters to train LLMs for various products, demonstrating its effectiveness in both efficiency and stability.The paper presents the design, implementation, and engineering experience of MegaScale, a production system for training large language models (LLMs) on over 10,000 GPUs. The authors address the challenges of training efficiency and stability at this scale, focusing on algorithmic and system component co-design, computation and communication overlapping, operator optimization, data pipeline optimization, and network performance tuning. They achieve 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving MFU by 1.34× compared to Megatron-LM. The paper also details their approach to maintaining stability, including the development of diagnosis tools for monitoring system components and events, and techniques for fault tolerance and straggler mitigation. MegaScale has been successfully deployed in datacenters to train LLMs for various products, demonstrating its effectiveness in both efficiency and stability.