January 2024 | ZHISHENG YE, WEI GAO, QINGHAO HU, PENG SUN, XIAOLIN WANG, YINGWEI LUO, TIANWEI ZHANG, YONGGANG WEN
This article provides a comprehensive survey of existing research efforts in deep learning (DL) workload scheduling in GPU datacenters, focusing on both training and inference workloads. The authors identify the unique characteristics and challenges of DL workloads, such as inherent heterogeneity, placement sensitivity, iterative processes, feedback-driven exploration, exclusive allocation versus GPU sharing, and gang scheduling versus elastic training. They categorize existing schedulers based on scheduling objectives and resource utilization manners, including efficiency, cost, fairness, and deadline guarantee. The survey also discusses the impact of heterogeneous resources, GPU sharing, and elastic training on resource utilization. The authors highlight the need for more efficient and fair scheduling mechanisms to address the challenges posed by DL workloads and suggest future research directions, such as optimizing deadline guarantee and cost efficiency, and managing emerging hardware resources. The article aims to provide a systematic overview of the current state of DL workload scheduling and guide future research and practical applications.This article provides a comprehensive survey of existing research efforts in deep learning (DL) workload scheduling in GPU datacenters, focusing on both training and inference workloads. The authors identify the unique characteristics and challenges of DL workloads, such as inherent heterogeneity, placement sensitivity, iterative processes, feedback-driven exploration, exclusive allocation versus GPU sharing, and gang scheduling versus elastic training. They categorize existing schedulers based on scheduling objectives and resource utilization manners, including efficiency, cost, fairness, and deadline guarantee. The survey also discusses the impact of heterogeneous resources, GPU sharing, and elastic training on resource utilization. The authors highlight the need for more efficient and fair scheduling mechanisms to address the challenges posed by DL workloads and suggest future research directions, such as optimizing deadline guarantee and cost efficiency, and managing emerging hardware resources. The article aims to provide a systematic overview of the current state of DL workload scheduling and guide future research and practical applications.