April 22-25, 2024 | Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, Wei Lin
HAP is an automated system for SPMD DNN training on heterogeneous GPU clusters. It jointly optimizes tensor sharding strategies, sharding ratios across heterogeneous devices, and communication methods for optimized distributed training with SPMD parallelism. HAP novelly formulates model partitioning as a program synthesis problem, generating a distributed program from scratch on a distributed instruction set that semantically resembles the program designed for a single device. It systematically explores the solution space with an A*-based search algorithm. HAP derives optimal tensor sharding ratios by formulating it as a linear programming problem. Additionally, HAP explores tensor communication optimization in a heterogeneous cluster and integrates it as part of the program synthesis process, automatically choosing optimal collective communication primitives and applying sufficient factor broadcasting technique. Extensive experiments on representative workloads demonstrate that HAP achieves up to 2.41x speed-up on heterogeneous clusters.
SPMD parallelism generalizes data parallelism and intra-layer model parallelism with tensor sharding along any of its dimensions and input data partitioning across the devices. It has been proven effective in training various state-of-the-art models. SPMD parallelism has so far been exploited on homogeneous clusters. Enabling efficient SPMD training on a set of heterogeneous resources facilitates better utilization of available resources for substantially lowered cost of large model learning. Three key decisions are involved in applying SPMD parallelism in heterogeneous clusters: (i) the sharding strategy, i.e., deciding which dimension to partition (sharding dimension) for each tensor; (ii) the sharding ratios across the devices, i.e., different tensor partition sizes to assign to heterogeneous devices according to their computation and memory capacities, to optimize device utilization; and (iii) selection of the communication methods, which decides the implementation of each collective communication operation for each tensor, to best cater to different tensor sizes and different interconnect bandwidths across devices. The three decisions are co-related.
HAP proposes an SPMD DNN training system for heterogeneous clusters that automatically decides optimal tensor sharding dimension/ratios and communication methods for expedited training and optimized resource utilization. HAP makes the following contributions in designing HAP: (1) an iterative optimization process that alternatively optimizes the SPMD sharding strategy and sharding ratios while fixing the other one. (2) novel formulation of SPMD model sharding as a program synthesis problem, to construct a distributed program on a distributed instruction set to emulate a given tensor program implemented on a single-device instruction set. (3) design of a linear cost model and formulation of sharding ratio optimization as a linear programming problem, and solve it optimally with off-the-shelf solvers. (4) exploration of two communication optimization techniques and integration into the program synthesis, to optimize communication on heterogeneous clusters jointly with SPMD sharding. (5) implementation of HAP on PyTorch and evaluation on a 64-GPU heterogeneous cluster on a public cloud. Experiments withHAP is an automated system for SPMD DNN training on heterogeneous GPU clusters. It jointly optimizes tensor sharding strategies, sharding ratios across heterogeneous devices, and communication methods for optimized distributed training with SPMD parallelism. HAP novelly formulates model partitioning as a program synthesis problem, generating a distributed program from scratch on a distributed instruction set that semantically resembles the program designed for a single device. It systematically explores the solution space with an A*-based search algorithm. HAP derives optimal tensor sharding ratios by formulating it as a linear programming problem. Additionally, HAP explores tensor communication optimization in a heterogeneous cluster and integrates it as part of the program synthesis process, automatically choosing optimal collective communication primitives and applying sufficient factor broadcasting technique. Extensive experiments on representative workloads demonstrate that HAP achieves up to 2.41x speed-up on heterogeneous clusters.
SPMD parallelism generalizes data parallelism and intra-layer model parallelism with tensor sharding along any of its dimensions and input data partitioning across the devices. It has been proven effective in training various state-of-the-art models. SPMD parallelism has so far been exploited on homogeneous clusters. Enabling efficient SPMD training on a set of heterogeneous resources facilitates better utilization of available resources for substantially lowered cost of large model learning. Three key decisions are involved in applying SPMD parallelism in heterogeneous clusters: (i) the sharding strategy, i.e., deciding which dimension to partition (sharding dimension) for each tensor; (ii) the sharding ratios across the devices, i.e., different tensor partition sizes to assign to heterogeneous devices according to their computation and memory capacities, to optimize device utilization; and (iii) selection of the communication methods, which decides the implementation of each collective communication operation for each tensor, to best cater to different tensor sizes and different interconnect bandwidths across devices. The three decisions are co-related.
HAP proposes an SPMD DNN training system for heterogeneous clusters that automatically decides optimal tensor sharding dimension/ratios and communication methods for expedited training and optimized resource utilization. HAP makes the following contributions in designing HAP: (1) an iterative optimization process that alternatively optimizes the SPMD sharding strategy and sharding ratios while fixing the other one. (2) novel formulation of SPMD model sharding as a program synthesis problem, to construct a distributed program on a distributed instruction set to emulate a given tensor program implemented on a single-device instruction set. (3) design of a linear cost model and formulation of sharding ratio optimization as a linear programming problem, and solve it optimally with off-the-shelf solvers. (4) exploration of two communication optimization techniques and integration into the program synthesis, to optimize communication on heterogeneous clusters jointly with SPMD sharding. (5) implementation of HAP on PyTorch and evaluation on a 64-GPU heterogeneous cluster on a public cloud. Experiments with