Understanding FLUX%3A Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

The paper "Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion" introduces a novel method called Flux to significantly reduce communication latencies in large deep learning models, particularly those using tensor parallelism. Tensor parallelism partitions operations or layers across devices to overcome memory limitations and accelerate computation, but it introduces additional communication overhead that can dominate overall runtime, especially on high-speed interconnects like NVLinks in GPU clusters. Flux addresses this issue by overdecomposing communication and computation operations into finer-grained tiles and fusing them into a single larger kernel. This approach effectively hides communication latency without compromising kernel efficiency. Flux can overlap up to 96% of communication in a fused kernel, achieving up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs, and up to 1.66x and 1.30x speedups for prefetch and decoding inference over vLLM on a cluster of 8 GPUs. The paper discusses the background of tensor parallelism and communication overlap techniques, highlighting the limitations of existing methods on GPUs. It proposes Flux, which decomposes operations into tiles and fuses them into a GEMM kernel, optimizing communication and computation together. The method is implemented using NVIDIA CUTLASS and is evaluated on various GPU clusters, demonstrating significant performance improvements over non-overlapping and prior overlapping methods. Key contributions include: - Identifying performance issues with existing communication overlap techniques. - Proposing a novel fine-grained communication overlap technique that fits modern GPU design. - Implementing Flux using CUTLASS with optimizations for various GPU generations and interconnects. - Achieving up to 1.24x, 1.66x, and 1.30x speedups for training, prefill, and decoding, respectively, on different GPU clusters. The paper also discusses the effectiveness of Flux in reducing communication time and improving overall system efficiency, making it a valuable tool for large-scale deep learning applications.The paper "Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion" introduces a novel method called Flux to significantly reduce communication latencies in large deep learning models, particularly those using tensor parallelism. Tensor parallelism partitions operations or layers across devices to overcome memory limitations and accelerate computation, but it introduces additional communication overhead that can dominate overall runtime, especially on high-speed interconnects like NVLinks in GPU clusters. Flux addresses this issue by overdecomposing communication and computation operations into finer-grained tiles and fusing them into a single larger kernel. This approach effectively hides communication latency without compromising kernel efficiency. Flux can overlap up to 96% of communication in a fused kernel, achieving up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs, and up to 1.66x and 1.30x speedups for prefetch and decoding inference over vLLM on a cluster of 8 GPUs. The paper discusses the background of tensor parallelism and communication overlap techniques, highlighting the limitations of existing methods on GPUs. It proposes Flux, which decomposes operations into tiles and fuses them into a GEMM kernel, optimizing communication and computation together. The method is implemented using NVIDIA CUTLASS and is evaluated on various GPU clusters, demonstrating significant performance improvements over non-overlapping and prior overlapping methods. Key contributions include: - Identifying performance issues with existing communication overlap techniques. - Proposing a novel fine-grained communication overlap technique that fits modern GPU design. - Implementing Flux using CUTLASS with optimizations for various GPU generations and interconnects. - Achieving up to 1.24x, 1.66x, and 1.30x speedups for training, prefill, and decoding, respectively, on different GPU clusters. The paper also discusses the effectiveness of Flux in reducing communication time and improving overall system efficiency, making it a valuable tool for large-scale deep learning applications.

FLUX: FAST SOFTWARE-BASED COMMUNICATION OVERLAP ON GPUs THROUGH KERNEL FUSION

June 21, 2024 | Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu

FLUX: FAST SOFTWARE-BASED COMMUNICATION OVERLAP ON GPUs THROUGH KERNEL FUSION

June 21, 2024 | Li-Wen Chang*, Wenlei Bao*, Qi Hou*, Chengquan Jiang*, Ningxin Zheng*, Yinmin Zhong*, Xuanrun Zhang*, Zuquan Song*, Ziheng Jiang*, Haibin Lin*, Xin Jin*, and Xin Liu*

June 21, 2024 | Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu