28 May 2024 | Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, Xinggang Wang
DiG is a scalable and efficient diffusion model that leverages Gated Linear Attention (GLA) Transformers to improve performance and efficiency in visual generation. The model, named Diffusion Gated Linear Attention Transformers (DiG), is designed to be simple and efficient with minimal parameter overhead, following the DiT design but offering superior efficiency and effectiveness. DiG-S/2 is 2.5 times faster than DiT-S/2 and saves 75.7% GPU memory at a resolution of 1792×1792. DiG demonstrates scalability across various computational complexities and outperforms other subquadratic-time diffusion models, such as Mamba-based models and DiT with CUDA-optimized FlashAttention-2. DiG achieves better performance and efficiency in image generation, making it suitable for large-scale long-sequence generation tasks. The model incorporates a lightweight spatial reorient & enhancement module (SREM) to address challenges in unidirectional scanning and local awareness. DiG also uses depth-wise convolution to provide local awareness with minimal parameters. The model's architecture is evaluated on the ImageNet dataset, showing superior performance compared to DiT. DiG is efficient in terms of training speed and GPU memory, making it a promising next-generation backbone for diffusion models. The paper presents an extensive analysis of DiG's performance, efficiency, and scalability, demonstrating its effectiveness in visual generation tasks.DiG is a scalable and efficient diffusion model that leverages Gated Linear Attention (GLA) Transformers to improve performance and efficiency in visual generation. The model, named Diffusion Gated Linear Attention Transformers (DiG), is designed to be simple and efficient with minimal parameter overhead, following the DiT design but offering superior efficiency and effectiveness. DiG-S/2 is 2.5 times faster than DiT-S/2 and saves 75.7% GPU memory at a resolution of 1792×1792. DiG demonstrates scalability across various computational complexities and outperforms other subquadratic-time diffusion models, such as Mamba-based models and DiT with CUDA-optimized FlashAttention-2. DiG achieves better performance and efficiency in image generation, making it suitable for large-scale long-sequence generation tasks. The model incorporates a lightweight spatial reorient & enhancement module (SREM) to address challenges in unidirectional scanning and local awareness. DiG also uses depth-wise convolution to provide local awareness with minimal parameters. The model's architecture is evaluated on the ImageNet dataset, showing superior performance compared to DiT. DiG is efficient in terms of training speed and GPU memory, making it a promising next-generation backbone for diffusion models. The paper presents an extensive analysis of DiG's performance, efficiency, and scalability, demonstrating its effectiveness in visual generation tasks.