Characterization of Large Language Model Development in the Datacenter

Characterization of Large Language Model Development in the Datacenter

3 Apr 2024 | Qinghao Hu*, Zhisheng Ye*, Zerui Wang*, Guoteng Wang, Meng Zhang*, Qiaoling Chen*, Peng Sun*, Dahua Lin*, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang
This paper presents an in-depth characterization study of large language model (LLM) development in a GPU datacenter, focusing on the challenges and opportunities in utilizing high-scale cluster resources efficiently. The study is based on a six-month workload trace from the Acme datacenter, which houses two LLM clusters: Seren and Kalos, equipped with 4,704 A100 GPUs. Key findings include: 1. **Shorter Job Durations and Unfair Queuing Delays**: LLM workloads exhibit significantly shorter average job durations compared to previous task-specific Deep Learning (DL) workloads. Evaluation jobs, despite being short-term and small-scale, have the longest queuing delays due to resource reservation for pretraining jobs. 2. **Imbalanced Resource Usage**: Pretraining jobs consume a disproportionate amount of GPU resources, while evaluation jobs use minimal resources. CPU, host memory, and network resources are frequently underutilized, with GPU memory and utilization showing high median values. 3. **Long GPU Idle Time in Evaluation Workload**: Evaluation jobs spend a significant portion of their time on model loading and data preprocessing, leading to long queuing delays and underutilization of GPU resources. 4. **Frequent Job Failures**: Failures primarily occur at the beginning of LLM workloads, with infrastructure failures being the most severe and frequent, impeding training progress. To address these challenges, the authors introduce two system efforts: 1. **Fault-Tolerant Pretraining**: Enhances fault tolerance through asynchronous checkpointing, LLM-assisted failure diagnosis, and automatic recovery. 2. **Decoupled Scheduling for Evaluation**: Provides timely performance feedback via trial decomposition and scheduling optimization, reducing evaluation makespan. The paper also discusses the background of LLM development, the characteristics of LLM workloads, and the infrastructure and software stack used in the datacenter. The findings and system designs aim to improve the efficiency and robustness of LLM development in high-scale GPU clusters.This paper presents an in-depth characterization study of large language model (LLM) development in a GPU datacenter, focusing on the challenges and opportunities in utilizing high-scale cluster resources efficiently. The study is based on a six-month workload trace from the Acme datacenter, which houses two LLM clusters: Seren and Kalos, equipped with 4,704 A100 GPUs. Key findings include: 1. **Shorter Job Durations and Unfair Queuing Delays**: LLM workloads exhibit significantly shorter average job durations compared to previous task-specific Deep Learning (DL) workloads. Evaluation jobs, despite being short-term and small-scale, have the longest queuing delays due to resource reservation for pretraining jobs. 2. **Imbalanced Resource Usage**: Pretraining jobs consume a disproportionate amount of GPU resources, while evaluation jobs use minimal resources. CPU, host memory, and network resources are frequently underutilized, with GPU memory and utilization showing high median values. 3. **Long GPU Idle Time in Evaluation Workload**: Evaluation jobs spend a significant portion of their time on model loading and data preprocessing, leading to long queuing delays and underutilization of GPU resources. 4. **Frequent Job Failures**: Failures primarily occur at the beginning of LLM workloads, with infrastructure failures being the most severe and frequent, impeding training progress. To address these challenges, the authors introduce two system efforts: 1. **Fault-Tolerant Pretraining**: Enhances fault tolerance through asynchronous checkpointing, LLM-assisted failure diagnosis, and automatic recovery. 2. **Decoupled Scheduling for Evaluation**: Provides timely performance feedback via trial decomposition and scheduling optimization, reducing evaluation makespan. The paper also discusses the background of LLM development, the characteristics of LLM workloads, and the infrastructure and software stack used in the datacenter. The findings and system designs aim to improve the efficiency and robustness of LLM development in high-scale GPU clusters.
Reach us at info@study.space