January 2024 | ZHISHENG YE, WEI GAO, QINGHAO HU, PENG SUN, XIAOLIN WANG, YINGWEI LUO, TIANWEI ZHANG, YONGGANG WEN
This survey presents a comprehensive overview of deep learning (DL) workload scheduling in GPU datacenters, focusing on both training and inference workloads. DL has become a key technology in various fields, requiring significant computational resources, especially GPUs. GPU datacenters are essential for DL development, but traditional scheduling methods for big data or high-performance computing (HPC) are not suitable for DL workloads. Recent research has developed specialized schedulers to optimize DL workloads in GPU datacenters. The survey analyzes existing scheduling approaches for DL training and inference, categorizing them based on scheduling objectives and resource consumption features. It discusses challenges such as heterogeneous resource management, placement sensitivity, iterative processes, and feedback-driven exploration. The survey also highlights future research directions, including emerging DL workloads, advanced scheduling decision-making, and underlying hardware resources. The survey emphasizes the importance of efficient scheduling to reduce operational costs, improve resource utilization, and enhance user experience. It also identifies limitations of current scheduling methods and suggests areas for improvement, such as better handling of heterogeneous resources and more effective scheduling strategies for different DL workloads. The survey concludes with a discussion of future research directions in DL workload scheduling.This survey presents a comprehensive overview of deep learning (DL) workload scheduling in GPU datacenters, focusing on both training and inference workloads. DL has become a key technology in various fields, requiring significant computational resources, especially GPUs. GPU datacenters are essential for DL development, but traditional scheduling methods for big data or high-performance computing (HPC) are not suitable for DL workloads. Recent research has developed specialized schedulers to optimize DL workloads in GPU datacenters. The survey analyzes existing scheduling approaches for DL training and inference, categorizing them based on scheduling objectives and resource consumption features. It discusses challenges such as heterogeneous resource management, placement sensitivity, iterative processes, and feedback-driven exploration. The survey also highlights future research directions, including emerging DL workloads, advanced scheduling decision-making, and underlying hardware resources. The survey emphasizes the importance of efficient scheduling to reduce operational costs, improve resource utilization, and enhance user experience. It also identifies limitations of current scheduling methods and suggests areas for improvement, such as better handling of heterogeneous resources and more effective scheduling strategies for different DL workloads. The survey concludes with a discussion of future research directions in DL workload scheduling.