[slides] DiG%3A Scalable and Efficient Diffusion Models with Gated Linear Attention

This paper introduces DiG (Diffusion Gated Linear Attention Transformers), a scalable and efficient diffusion model designed to address the challenges of large-scale pre-training in visual content generation. DiG leverages the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, offering superior efficiency and effectiveness compared to existing models like Diffusion Transformers (DiT). Key contributions include: 1. **Efficiency and Scalability**: DiG achieves 2.5× higher training speed and saves 75.7% GPU memory at a resolution of 1792 × 1792, outperforming DiT in terms of both speed and resource utilization. 2. **Scalability Analysis**: DiG consistently demonstrates decreasing FID scores with increased depth/width or augmentation of input tokens, showcasing its scalability across various computational complexities. 3. **Performance Comparison**: DiG-XL/2 is 4.2× faster than the Mamba-based diffusion model at a 1024 resolution and 1.8× faster than DiT with CUDA-optimized FlashAttention-2 at a 2048 resolution. 4. **Ablation Study**: Extensive experiments validate the effectiveness of the proposed Spatial Reorient & Enhancement Module (SREM), which enables layer-wise scanning direction control and local awareness. 5. **Model Architecture**: DiG incorporates a lightweight SREM and depth-wise convolution (DWConv) to enhance local awareness and improve efficiency. The paper also provides a detailed methodological overview, including the design of the DiG block and its integration into the diffusion model. Experimental results on the ImageNet dataset demonstrate DiG's superior performance and efficiency, making it a promising candidate for large-scale long-sequence generation tasks.This paper introduces DiG (Diffusion Gated Linear Attention Transformers), a scalable and efficient diffusion model designed to address the challenges of large-scale pre-training in visual content generation. DiG leverages the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, offering superior efficiency and effectiveness compared to existing models like Diffusion Transformers (DiT). Key contributions include: 1. **Efficiency and Scalability**: DiG achieves 2.5× higher training speed and saves 75.7% GPU memory at a resolution of 1792 × 1792, outperforming DiT in terms of both speed and resource utilization. 2. **Scalability Analysis**: DiG consistently demonstrates decreasing FID scores with increased depth/width or augmentation of input tokens, showcasing its scalability across various computational complexities. 3. **Performance Comparison**: DiG-XL/2 is 4.2× faster than the Mamba-based diffusion model at a 1024 resolution and 1.8× faster than DiT with CUDA-optimized FlashAttention-2 at a 2048 resolution. 4. **Ablation Study**: Extensive experiments validate the effectiveness of the proposed Spatial Reorient & Enhancement Module (SREM), which enables layer-wise scanning direction control and local awareness. 5. **Model Architecture**: DiG incorporates a lightweight SREM and depth-wise convolution (DWConv) to enhance local awareness and improve efficiency. The paper also provides a detailed methodological overview, including the design of the DiG block and its integration into the diffusion model. Experimental results on the ImageNet dataset demonstrate DiG's superior performance and efficiency, making it a promising candidate for large-scale long-sequence generation tasks.

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

28 May 2024 | Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, Xinggang Wang