12 Jun 2024 | Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang
DiTFastAttn is a post-training compression method designed to reduce the computational cost of Diffusion Transformers (DiT) used for image and video generation. DiT models suffer from high computational demands due to the quadratic complexity of self-attention mechanisms. DiTFastAttn addresses this by identifying and eliminating three types of redundancy in attention computation: spatial redundancy, temporal similarity between steps, and conditional redundancy between conditional and unconditional inferences. The method introduces three techniques: Window Attention with Residual Sharing (WA-RS) to reduce spatial redundancy, Attention Sharing across Timesteps (AST) to exploit step-wise similarities, and Attention Sharing across CFG (ASC) to skip redundant computations during conditional generation. These techniques significantly reduce the number of floating-point operations (FLOPs) and improve inference speed. Evaluation on DiT, PixArt-Sigma, and OpenSora models shows that DiTFastAttn reduces FLOPs by up to 88% and achieves a 1.6x speedup for high-resolution image generation. The method also maintains high-quality outputs, even with significant compression. The approach is compatible with various sampling methods and has been tested on different resolutions and video generation tasks. The results demonstrate that DiTFastAttn effectively reduces computational costs while preserving model performance, making it suitable for deployment in resource-constrained environments. The method is implemented using FlashAttention-2 and has been validated through extensive experiments on multiple diffusion models. The algorithm uses a greedy strategy to select the most effective compression techniques for each layer and step, ensuring optimal performance and efficiency. The results show that DiTFastAttn achieves significant improvements in both computational efficiency and generation quality, making it a valuable tool for accelerating diffusion models.DiTFastAttn is a post-training compression method designed to reduce the computational cost of Diffusion Transformers (DiT) used for image and video generation. DiT models suffer from high computational demands due to the quadratic complexity of self-attention mechanisms. DiTFastAttn addresses this by identifying and eliminating three types of redundancy in attention computation: spatial redundancy, temporal similarity between steps, and conditional redundancy between conditional and unconditional inferences. The method introduces three techniques: Window Attention with Residual Sharing (WA-RS) to reduce spatial redundancy, Attention Sharing across Timesteps (AST) to exploit step-wise similarities, and Attention Sharing across CFG (ASC) to skip redundant computations during conditional generation. These techniques significantly reduce the number of floating-point operations (FLOPs) and improve inference speed. Evaluation on DiT, PixArt-Sigma, and OpenSora models shows that DiTFastAttn reduces FLOPs by up to 88% and achieves a 1.6x speedup for high-resolution image generation. The method also maintains high-quality outputs, even with significant compression. The approach is compatible with various sampling methods and has been tested on different resolutions and video generation tasks. The results demonstrate that DiTFastAttn effectively reduces computational costs while preserving model performance, making it suitable for deployment in resource-constrained environments. The method is implemented using FlashAttention-2 and has been validated through extensive experiments on multiple diffusion models. The algorithm uses a greedy strategy to select the most effective compression techniques for each layer and step, ensuring optimal performance and efficiency. The results show that DiTFastAttn achieves significant improvements in both computational efficiency and generation quality, making it a valuable tool for accelerating diffusion models.