**DiTFastAttn: Attention Compression for Diffusion Transformer Models**
**Authors:** Zhihang Yuan
**Affiliations:** Tsinghua University, Infinigence AI, Carnegie Mellon University, Shanghai Jiao Tong University
**Abstract:**
Diffusion Transformers (DiT) are highly effective in image and video generation but suffer from computational bottlenecks due to the quadratic complexity of self-attention. This paper introduces DiTFastAttn, a post-training compression method to address this issue. The method identifies three key redundancies in DiT inference: spatial redundancy, temporal redundancy, and conditional redundancy. To tackle these redundancies, DiTFastAttn proposes three techniques: Window Attention with Residual Caching, Temporal Similarity Reduction, and Conditional Redundancy Elimination. Extensive experiments on DiT, PixArt-Sigma, and OpenSora models demonstrate that DiTFastAttn reduces up to 88% of FLOPs and achieves a speedup of up to 1.6x in high-resolution image generation.
**Introduction:**
Diffusion Transformers (DiT) are popular for image and video generation but face significant computational challenges, especially at high resolutions. Traditional transformer architectures have quadratic complexity in attention computation, leading to substantial computational costs. Previous efforts to accelerate attention mechanisms often require retraining, which is costly. DiTFastAttn addresses these issues by identifying and reducing redundancies in attention computation during inference.
**Related Work:**
The paper reviews existing methods for diffusion models and vision transformer compression, including local attention, attention sharing, and other techniques that aim to reduce computational overhead.
**Method:**
DiTFastAttn introduces three main techniques:
1. **Window Attention with Residual Sharing (WA-RS):** Reduces spatial redundancy by using window attention and caching residuals.
2. **Attention Sharing across Timesteps (AST):** Exploits temporal similarity to skip redundant computations.
3. **Attention Sharing across CFG (ASC):** Utilizes similarity between conditional and unconditional inferences to reduce redundant computations.
**Experiments:**
DiTFastAttn is evaluated on DiT, PixArt-Sigma, and OpenSora models for image and video generation tasks. Results show significant reductions in FLOPs and latency, with minimal impact on generation quality. The method is particularly effective at higher resolutions, achieving up to 88% reduction in attention computation and a 1.6x speedup in 2048x2048 image generation.
**Conclusion:**
DiTFastAttn effectively compresses DiT models, reducing computational costs and improving efficiency without significant performance degradation. Future work will focus on training-aware compression methods and extending the approach to other modules.**DiTFastAttn: Attention Compression for Diffusion Transformer Models**
**Authors:** Zhihang Yuan
**Affiliations:** Tsinghua University, Infinigence AI, Carnegie Mellon University, Shanghai Jiao Tong University
**Abstract:**
Diffusion Transformers (DiT) are highly effective in image and video generation but suffer from computational bottlenecks due to the quadratic complexity of self-attention. This paper introduces DiTFastAttn, a post-training compression method to address this issue. The method identifies three key redundancies in DiT inference: spatial redundancy, temporal redundancy, and conditional redundancy. To tackle these redundancies, DiTFastAttn proposes three techniques: Window Attention with Residual Caching, Temporal Similarity Reduction, and Conditional Redundancy Elimination. Extensive experiments on DiT, PixArt-Sigma, and OpenSora models demonstrate that DiTFastAttn reduces up to 88% of FLOPs and achieves a speedup of up to 1.6x in high-resolution image generation.
**Introduction:**
Diffusion Transformers (DiT) are popular for image and video generation but face significant computational challenges, especially at high resolutions. Traditional transformer architectures have quadratic complexity in attention computation, leading to substantial computational costs. Previous efforts to accelerate attention mechanisms often require retraining, which is costly. DiTFastAttn addresses these issues by identifying and reducing redundancies in attention computation during inference.
**Related Work:**
The paper reviews existing methods for diffusion models and vision transformer compression, including local attention, attention sharing, and other techniques that aim to reduce computational overhead.
**Method:**
DiTFastAttn introduces three main techniques:
1. **Window Attention with Residual Sharing (WA-RS):** Reduces spatial redundancy by using window attention and caching residuals.
2. **Attention Sharing across Timesteps (AST):** Exploits temporal similarity to skip redundant computations.
3. **Attention Sharing across CFG (ASC):** Utilizes similarity between conditional and unconditional inferences to reduce redundant computations.
**Experiments:**
DiTFastAttn is evaluated on DiT, PixArt-Sigma, and OpenSora models for image and video generation tasks. Results show significant reductions in FLOPs and latency, with minimal impact on generation quality. The method is particularly effective at higher resolutions, achieving up to 88% reduction in attention computation and a 1.6x speedup in 2048x2048 image generation.
**Conclusion:**
DiTFastAttn effectively compresses DiT models, reducing computational costs and improving efficiency without significant performance degradation. Future work will focus on training-aware compression methods and extending the approach to other modules.