June 21, 2024 | Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu
Flux is a novel method for hiding communication latencies in GPU-based deep learning training and inference. It overdecomposes communication and computation into finer-grained operations and fuses them into a single kernel to effectively hide communication without compromising kernel efficiency. Flux can overlap up to 96% of communication given a fused kernel. It achieves up to 1.24x speedups for training on 128-GPU clusters and up to 1.66x and 1.30x speedups for prefill and decoding inference on 8-GPU clusters compared to existing methods like Megatron-LM and vLLM.
The paper introduces Flux, which improves communication overlapping by using kernel fusion and fine-grained decomposition. It is implemented using NVIDIA CUTLASS and can be easily auto-tuned across various GPU architectures and interconnects. Flux optimizes communication together with computation, including kernel fusion, tile coordinate swizzling, GPU instruction selection, and communication order selection. It is designed to better adapt to GPU architecture and interconnects.
Flux is implemented with CUTLASS 3.4.1 and NVSHMEM 2.10.1, and evaluated on three different clusters: A100 PCIe, A100 NVLink, and H800 NVLink. It achieves significant speedups and overlap efficiency compared to existing methods like TransformerEngine and Megatron-LM. Flux outperforms these methods in both operation-level and model-level evaluations, particularly for larger m sizes and higher communication proportions. It also demonstrates robustness in various scenarios, including small m sizes and decoding phases.
Flux is a communication overlapping solution that can work with accelerated collective communication and communication compression techniques. It is also compatible with ZeRO sharding techniques and can be applied to activations, weights, and gradients. The paper concludes that Flux is a crucial technique for running large deep learning models with tensor parallelism, as it significantly reduces exposed communication time and improves system FLOPS utilization.Flux is a novel method for hiding communication latencies in GPU-based deep learning training and inference. It overdecomposes communication and computation into finer-grained operations and fuses them into a single kernel to effectively hide communication without compromising kernel efficiency. Flux can overlap up to 96% of communication given a fused kernel. It achieves up to 1.24x speedups for training on 128-GPU clusters and up to 1.66x and 1.30x speedups for prefill and decoding inference on 8-GPU clusters compared to existing methods like Megatron-LM and vLLM.
The paper introduces Flux, which improves communication overlapping by using kernel fusion and fine-grained decomposition. It is implemented using NVIDIA CUTLASS and can be easily auto-tuned across various GPU architectures and interconnects. Flux optimizes communication together with computation, including kernel fusion, tile coordinate swizzling, GPU instruction selection, and communication order selection. It is designed to better adapt to GPU architecture and interconnects.
Flux is implemented with CUTLASS 3.4.1 and NVSHMEM 2.10.1, and evaluated on three different clusters: A100 PCIe, A100 NVLink, and H800 NVLink. It achieves significant speedups and overlap efficiency compared to existing methods like TransformerEngine and Megatron-LM. Flux outperforms these methods in both operation-level and model-level evaluations, particularly for larger m sizes and higher communication proportions. It also demonstrates robustness in various scenarios, including small m sizes and decoding phases.
Flux is a communication overlapping solution that can work with accelerated collective communication and communication compression techniques. It is also compatible with ZeRO sharding techniques and can be applied to activations, weights, and gradients. The paper concludes that Flux is a crucial technique for running large deep learning models with tensor parallelism, as it significantly reduces exposed communication time and improves system FLOPS utilization.