April 22–25, 2024 | Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, Wei Lin
The paper introduces *HAP*, an automated system designed to optimize the training of large deep neural networks (DNNs) on heterogeneous GPU clusters using Single-Program-Multiple-Data (SPMD) parallelism. *HAP* jointly optimizes the tensor sharding strategy, sharding ratios across heterogeneous devices, and communication methods for tensor exchanges. It formulates model partitioning as a program synthesis problem, generating a distributed program from a single-device instruction set. The system uses an A*-based search algorithm to explore the solution space and derives optimal sharding ratios through linear programming. *HAP* also integrates communication optimization techniques, such as padded All-Gather and grouped Broadcast, and sufficient factor broadcasting, to enhance performance on heterogeneous clusters. Extensive experiments demonstrate that *HAP* achieves up to 2.41x speed-up on heterogeneous clusters, outperforming existing systems while maintaining competitive performance on homogeneous clusters.The paper introduces *HAP*, an automated system designed to optimize the training of large deep neural networks (DNNs) on heterogeneous GPU clusters using Single-Program-Multiple-Data (SPMD) parallelism. *HAP* jointly optimizes the tensor sharding strategy, sharding ratios across heterogeneous devices, and communication methods for tensor exchanges. It formulates model partitioning as a program synthesis problem, generating a distributed program from a single-device instruction set. The system uses an A*-based search algorithm to explore the solution space and derives optimal sharding ratios through linear programming. *HAP* also integrates communication optimization techniques, such as padded All-Gather and grouped Broadcast, and sufficient factor broadcasting, to enhance performance on heterogeneous clusters. Extensive experiments demonstrate that *HAP* achieves up to 2.41x speed-up on heterogeneous clusters, outperforming existing systems while maintaining competitive performance on homogeneous clusters.