DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

14 Jul 2024 | Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han
DistriFusion is a training-free algorithm that accelerates diffusion model inference using multiple GPUs without sacrificing image quality. It addresses the challenge of high computational costs and latency in generating high-resolution images with diffusion models. The method splits the image into patches and assigns each to a GPU, enabling parallel processing. However, naive parallelization leads to visible seams due to lack of interaction between patches, while incorporating interaction incurs communication overhead. DistriFusion overcomes this by using displaced patch parallelism, which reuses pre-computed features from previous steps to provide context for the current step. This approach allows asynchronous communication, which can be pipelined into the computation pipeline. DistriFusion achieves up to a 6.1× speedup on eight A100 GPUs compared to one, while maintaining image quality. It is applicable to recent Stable Diffusion XL models and reduces computation proportionally to the number of used devices. The method also reduces the latency of SDXL U-Net for generating a single image by up to 1.8×, 3.4×, and 6.1× with 2, 4, and 8 A100 GPUs, respectively. When combined with batch splitting for classifier-free guidance, it achieves 3.6× and 6.6× speedups for 3840×3840 images. DistriFusion is based on patch parallelism, which divides the image into multiple patches, each assigned to a different device. The key observation is that inputs across adjacent denoising steps are similar, allowing the reuse of pre-computed activations from the previous step. This approach reduces communication overhead and enables efficient parallelization. DistriFusion also introduces a new parallelization opportunity: displaced patch parallelism, which leverages the sequential nature of diffusion models to overlap communication and computation. The method is efficient and effective, achieving significant speedups while maintaining image quality. It is applicable to a majority of few-step samplers and requires only off-the-shelf pre-trained diffusion models. DistriFusion is benchmarked on a subset of COCO Captions and mirrors the performance of the original Stable Diffusion XL while reducing computation proportionally to the number of used devices. The method also reduces the latency of SDXL U-Net for generating a single image by up to 1.8×, 3.4×, and 6.1× with 2, 4, and 8 A100 GPUs, respectively. When combined with batch splitting for classifier-free guidance, it achieves 3.6× and 6.6× speedups for 3840×3840 images. DistriFusion is a novel approach that enables efficient parallelization of diffusion models, reducing latency while maintaining image quality. It is applicable to a wide range of diffusion models and can be used in real-time applications. The method is basedDistriFusion is a training-free algorithm that accelerates diffusion model inference using multiple GPUs without sacrificing image quality. It addresses the challenge of high computational costs and latency in generating high-resolution images with diffusion models. The method splits the image into patches and assigns each to a GPU, enabling parallel processing. However, naive parallelization leads to visible seams due to lack of interaction between patches, while incorporating interaction incurs communication overhead. DistriFusion overcomes this by using displaced patch parallelism, which reuses pre-computed features from previous steps to provide context for the current step. This approach allows asynchronous communication, which can be pipelined into the computation pipeline. DistriFusion achieves up to a 6.1× speedup on eight A100 GPUs compared to one, while maintaining image quality. It is applicable to recent Stable Diffusion XL models and reduces computation proportionally to the number of used devices. The method also reduces the latency of SDXL U-Net for generating a single image by up to 1.8×, 3.4×, and 6.1× with 2, 4, and 8 A100 GPUs, respectively. When combined with batch splitting for classifier-free guidance, it achieves 3.6× and 6.6× speedups for 3840×3840 images. DistriFusion is based on patch parallelism, which divides the image into multiple patches, each assigned to a different device. The key observation is that inputs across adjacent denoising steps are similar, allowing the reuse of pre-computed activations from the previous step. This approach reduces communication overhead and enables efficient parallelization. DistriFusion also introduces a new parallelization opportunity: displaced patch parallelism, which leverages the sequential nature of diffusion models to overlap communication and computation. The method is efficient and effective, achieving significant speedups while maintaining image quality. It is applicable to a majority of few-step samplers and requires only off-the-shelf pre-trained diffusion models. DistriFusion is benchmarked on a subset of COCO Captions and mirrors the performance of the original Stable Diffusion XL while reducing computation proportionally to the number of used devices. The method also reduces the latency of SDXL U-Net for generating a single image by up to 1.8×, 3.4×, and 6.1× with 2, 4, and 8 A100 GPUs, respectively. When combined with batch splitting for classifier-free guidance, it achieves 3.6× and 6.6× speedups for 3840×3840 images. DistriFusion is a novel approach that enables efficient parallelization of diffusion models, reducing latency while maintaining image quality. It is applicable to a wide range of diffusion models and can be used in real-time applications. The method is based
Reach us at info@study.space
[slides] DistriFusion%3A Distributed Parallel Inference for High-Resolution Diffusion Models | StudySpace