Characterization of Large Language Model Development in the Datacenter

Characterization of Large Language Model Development in the Datacenter

3 Apr 2024 | Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang
This paper presents an in-depth characterization study of a six-month LLM development workload trace collected from the GPU datacenter Acme. The study investigates discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explores resource utilization patterns, and identifies the impact of various job failures. Our analysis summarizes the hurdles encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization. LLMs have impressive performance across several transformative tasks. However, efficiently utilizing large-scale cluster resources to develop LLMs is non-trivial, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. The paper discusses the development of LLMs, which is closely intertwined with the support of GPU clusters in various aspects. A thorough analysis of cluster workloads is essential for understanding challenges and uncovering opportunities in designing systems tailored for LLMs. However, many conclusions and implications from existing DL workload analysis works are not applicable to LLM development due to the divergent characteristics and requirements of LLMs. The paper presents three main findings: (1) shorter job duration and unfair queuing delay. In contrast to the common stereotype that LLM workloads are usually long-term, the workloads in our datacenter exhibit 2.7~12.8× shorter average job duration compared to the DL workloads in previous traces. This can be attributed to the presence of numerous short-term tasks such as evaluation. In terms of job queuing delay, our findings also diverge from previous DL traces that larger-scale jobs experience longer wait times. We observe that evaluation jobs, despite being short-term and small-scale, have the longest queuing delay. This discrepancy stems from reserving the majority of resources for pretraining jobs to minimize their queuing delays. Evaluation jobs are scheduled with a lower priority, utilizing the limited spare resources. (2) Imbalanced resource usage. The imbalance is manifested in two aspects. Firstly, in terms of workload distribution, pretraining jobs only account for 3.2% of the total job count but consume 94.0% of the whole compute resource (i.e., GPU time) in Kalos. Conversely, evaluation jobs, despite constituting 92.9% of all jobs, only utilize a meager 0.8% of resources. Secondly, when looking at infrastructure utilization, we find that associated resources including CPU, host memory, and network, are frequently underutilized. In contrast, the GPU, as the primary resource, shows high utilization rates. Both GPU memory and GPU utilization exhibit substantially higher median values at 75% (60GB) and 9This paper presents an in-depth characterization study of a six-month LLM development workload trace collected from the GPU datacenter Acme. The study investigates discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explores resource utilization patterns, and identifies the impact of various job failures. Our analysis summarizes the hurdles encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization. LLMs have impressive performance across several transformative tasks. However, efficiently utilizing large-scale cluster resources to develop LLMs is non-trivial, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. The paper discusses the development of LLMs, which is closely intertwined with the support of GPU clusters in various aspects. A thorough analysis of cluster workloads is essential for understanding challenges and uncovering opportunities in designing systems tailored for LLMs. However, many conclusions and implications from existing DL workload analysis works are not applicable to LLM development due to the divergent characteristics and requirements of LLMs. The paper presents three main findings: (1) shorter job duration and unfair queuing delay. In contrast to the common stereotype that LLM workloads are usually long-term, the workloads in our datacenter exhibit 2.7~12.8× shorter average job duration compared to the DL workloads in previous traces. This can be attributed to the presence of numerous short-term tasks such as evaluation. In terms of job queuing delay, our findings also diverge from previous DL traces that larger-scale jobs experience longer wait times. We observe that evaluation jobs, despite being short-term and small-scale, have the longest queuing delay. This discrepancy stems from reserving the majority of resources for pretraining jobs to minimize their queuing delays. Evaluation jobs are scheduled with a lower priority, utilizing the limited spare resources. (2) Imbalanced resource usage. The imbalance is manifested in two aspects. Firstly, in terms of workload distribution, pretraining jobs only account for 3.2% of the total job count but consume 94.0% of the whole compute resource (i.e., GPU time) in Kalos. Conversely, evaluation jobs, despite constituting 92.9% of all jobs, only utilize a meager 0.8% of resources. Secondly, when looking at infrastructure utilization, we find that associated resources including CPU, host memory, and network, are frequently underutilized. In contrast, the GPU, as the primary resource, shows high utilization rates. Both GPU memory and GPU utilization exhibit substantially higher median values at 75% (60GB) and 9
Reach us at info@study.space