[slides] DistriFusion%3A Distributed Parallel Inference for High-Resolution Diffusion Models

**DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models** **Authors:** Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han **GitHub:** https://github.com/mit-han-lab/distrifuser **Abstract:** Diffusion models have achieved significant success in generating high-quality images, but their application in high-resolution image synthesis remains challenging due to high computational costs. This paper introduces DistriFusion, a training-free algorithm that leverages multiple GPUs to accelerate diffusion model inference without compromising image quality. By splitting the input into multiple patches and assigning each patch to a different GPU, DistriFusion overcomes the limitations of naive patch-based approaches, which suffer from fragmentation and communication overhead. DistriFusion employs displaced patch parallelism, which reuses pre-computed feature maps from previous steps to provide context for the current step, allowing for asynchronous communication and pipelined computation. Extensive experiments demonstrate that DistriFusion can achieve up to a 6.1× speedup on eight A100 GPUs compared to using a single GPU, while maintaining the quality of the original Stable Diffusion XL (SDXL) model. **Introduction:** The advent of AI-generated content has revolutionized technological innovation, with tools like Adobe Firefly, Midjourney, and Sora showcasing advanced capabilities in image and design generation. Diffusion models, such as Stable Diffusion, have played a crucial role in this progress by generating photorealistic images from text descriptions. However, generating high-resolution images with these models is still computationally intensive, leading to prohibitive latency for interactive applications. DistriFusion addresses this challenge by leveraging parallelism across multiple GPUs. **Related Work:** Previous work on accelerating diffusion model inference has focused on reducing sampling steps and optimizing neural network inference. However, these approaches often suffer from high communication overhead, making them impractical for real-world applications. DistriFusion introduces a novel parallelism paradigm, displaced patch parallelism, which leverages the sequential nature of diffusion models to overlap communication and computation, effectively hiding communication overhead within the computation phase. **Method:** DistriFusion divides the input image into patches and assigns each patch to a different GPU. By reusing pre-computed activations from previous steps, the method maintains patch interactions while reducing communication overhead. The key insight is to use slightly outdated activations from the previous step, known as activation displacement, to facilitate interactions between patches. This approach ensures that each GPU processes only a fraction of the total computations, enabling efficient parallelization. **Experiments:** DistriFusion is evaluated on the COCO Captions dataset using the SDXL model. Results show that DistriFusion achieves significant speedups while maintaining visual fidelity. For example, on 102**DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models** **Authors:** Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han **GitHub:** https://github.com/mit-han-lab/distrifuser **Abstract:** Diffusion models have achieved significant success in generating high-quality images, but their application in high-resolution image synthesis remains challenging due to high computational costs. This paper introduces DistriFusion, a training-free algorithm that leverages multiple GPUs to accelerate diffusion model inference without compromising image quality. By splitting the input into multiple patches and assigning each patch to a different GPU, DistriFusion overcomes the limitations of naive patch-based approaches, which suffer from fragmentation and communication overhead. DistriFusion employs displaced patch parallelism, which reuses pre-computed feature maps from previous steps to provide context for the current step, allowing for asynchronous communication and pipelined computation. Extensive experiments demonstrate that DistriFusion can achieve up to a 6.1× speedup on eight A100 GPUs compared to using a single GPU, while maintaining the quality of the original Stable Diffusion XL (SDXL) model. **Introduction:** The advent of AI-generated content has revolutionized technological innovation, with tools like Adobe Firefly, Midjourney, and Sora showcasing advanced capabilities in image and design generation. Diffusion models, such as Stable Diffusion, have played a crucial role in this progress by generating photorealistic images from text descriptions. However, generating high-resolution images with these models is still computationally intensive, leading to prohibitive latency for interactive applications. DistriFusion addresses this challenge by leveraging parallelism across multiple GPUs. **Related Work:** Previous work on accelerating diffusion model inference has focused on reducing sampling steps and optimizing neural network inference. However, these approaches often suffer from high communication overhead, making them impractical for real-world applications. DistriFusion introduces a novel parallelism paradigm, displaced patch parallelism, which leverages the sequential nature of diffusion models to overlap communication and computation, effectively hiding communication overhead within the computation phase. **Method:** DistriFusion divides the input image into patches and assigns each patch to a different GPU. By reusing pre-computed activations from previous steps, the method maintains patch interactions while reducing communication overhead. The key insight is to use slightly outdated activations from the previous step, known as activation displacement, to facilitate interactions between patches. This approach ensures that each GPU processes only a fraction of the total computations, enabling efficient parallelization. **Experiments:** DistriFusion is evaluated on the COCO Captions dataset using the SDXL model. Results show that DistriFusion achieves significant speedups while maintaining visual fidelity. For example, on 102

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

14 Jul 2024 | Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han