April 27-May 1, 2024 | Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair
The paper introduces T3, a novel approach that transparently overlaps serialized communication with computation in distributed machine learning, particularly focusing on Tensor Parallelism (TP). T3 addresses the challenges of fine-grained overlap by using hardware-software co-design, including a lightweight track and trigger mechanism and compute-enhanced memories. This approach minimizes resource contention and efficiently overlaps communication with computation, leading to significant speedups in communication-heavy sublayers of large language models. For models like T-NLG, T3 achieves a 30% geometric mean speedup (max 47%) and reduces data movement by 22% geometric mean (max 36%). The benefits persist as models scale, with a 29% geometric mean speedup for sublayers in models with ~500 billion parameters. T3's effectiveness is demonstrated through simulations and real-world experiments, showing improved performance in both training and inference.The paper introduces T3, a novel approach that transparently overlaps serialized communication with computation in distributed machine learning, particularly focusing on Tensor Parallelism (TP). T3 addresses the challenges of fine-grained overlap by using hardware-software co-design, including a lightweight track and trigger mechanism and compute-enhanced memories. This approach minimizes resource contention and efficiently overlaps communication with computation, leading to significant speedups in communication-heavy sublayers of large language models. For models like T-NLG, T3 achieves a 30% geometric mean speedup (max 47%) and reduces data movement by 22% geometric mean (max 36%). The benefits persist as models scale, with a 29% geometric mean speedup for sublayers in models with ~500 billion parameters. T3's effectiveness is demonstrated through simulations and real-world experiments, showing improved performance in both training and inference.