April 27-May 1, 2024 | Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
T3 is a hardware-software co-designed solution that enables fine-grained overlap of serialized communication with compute, minimizing resource contention. It transparently fuses producer operations with subsequent communication by configuring the producer's output address space to initiate communication directly on the producer's store, requiring minimal application changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in ~500-billion parameter models, PALM and MT-NLG.
T3 addresses the challenges of fine-grained compute-communication overlap by using a lightweight and programmable hardware tracker to track producer/communication progress and triggers communication using pre-programmed DMA commands, requiring no additional GPU compute resources for communication. Furthermore, to reduce contention for memory bandwidth between the producer and communication, T3 leverages recently proposed compute-enhanced memories to atomically update memory on stores, thus reducing memory traffic due to communication-related reductions. Finally, T3 employs a simple yet effective arbitration policy between the producer and communication memory streams to minimize any remaining contention. Overall, T3 transparently overlaps serialized communication with minimal resource contention. This improves compute and network utilization, and in turn, can enable better throughput scaling with increasing device count.
T3's key contributions include enabling fine-grained overlap of serialized communication with its producer computation while lowering application impact and managing compute and memory interference. To manage application impact, T3 configures the producer’s output address space mapping to initiate communication on stores, requiring minor modifications to the producer kernels. To manage compute resources contention, T3 uses a lightweight programmable tracker that tracks producer progress and triggers communication using existing DMA engines requiring no additional compute resources. Finally, to tackle memory bandwidth contention between computation and communication, T3 harnesses emerging near-memory compute technology to reduce data movement due to communication. Further, T3 also devises a simple yet effective memory controller arbitration policy to better interleave computation and communication memory traffic.
T3 extends Accel-Sim to accurately model multi-GPU systems (6% error). Our results show that T3 speeds up sliced Transformer sub-layers from models like Mega-GPT-2 and T-NLG by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as modelsT3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
T3 is a hardware-software co-designed solution that enables fine-grained overlap of serialized communication with compute, minimizing resource contention. It transparently fuses producer operations with subsequent communication by configuring the producer's output address space to initiate communication directly on the producer's store, requiring minimal application changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in ~500-billion parameter models, PALM and MT-NLG.
T3 addresses the challenges of fine-grained compute-communication overlap by using a lightweight and programmable hardware tracker to track producer/communication progress and triggers communication using pre-programmed DMA commands, requiring no additional GPU compute resources for communication. Furthermore, to reduce contention for memory bandwidth between the producer and communication, T3 leverages recently proposed compute-enhanced memories to atomically update memory on stores, thus reducing memory traffic due to communication-related reductions. Finally, T3 employs a simple yet effective arbitration policy between the producer and communication memory streams to minimize any remaining contention. Overall, T3 transparently overlaps serialized communication with minimal resource contention. This improves compute and network utilization, and in turn, can enable better throughput scaling with increasing device count.
T3's key contributions include enabling fine-grained overlap of serialized communication with its producer computation while lowering application impact and managing compute and memory interference. To manage application impact, T3 configures the producer’s output address space mapping to initiate communication on stores, requiring minor modifications to the producer kernels. To manage compute resources contention, T3 uses a lightweight programmable tracker that tracks producer progress and triggers communication using existing DMA engines requiring no additional compute resources. Finally, to tackle memory bandwidth contention between computation and communication, T3 harnesses emerging near-memory compute technology to reduce data movement due to communication. Further, T3 also devises a simple yet effective memory controller arbitration policy to better interleave computation and communication memory traffic.
T3 extends Accel-Sim to accurately model multi-GPU systems (6% error). Our results show that T3 speeds up sliced Transformer sub-layers from models like Mega-GPT-2 and T-NLG by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models