[slides and audio] Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

The paper introduces Diffusion Mamba (DiM), a novel architecture that integrates the efficiency of the Mamba state space model with the generative capabilities of diffusion processes. DiM addresses the computational challenges of traditional diffusion transformers (DiT) by avoiding attention mechanisms and instead using a scalable alternative, achieving linear complexity with respect to sequence length. This approach not only reduces computational overhead but also maintains high efficiency, making it suitable for generating high-resolution images and videos. The paper presents a comprehensive analysis of existing models, including state space models, diffusion models, and diffusion transformers, and details the development and implementation of DiM. Extensive experimental results on datasets like ImageNet and UCF-101 demonstrate DiM's superior performance and efficiency, setting new benchmarks for image and video generation. The paper concludes by highlighting the potential of DiM in various applications, such as media production and education, while also acknowledging limitations and areas for further exploration.The paper introduces Diffusion Mamba (DiM), a novel architecture that integrates the efficiency of the Mamba state space model with the generative capabilities of diffusion processes. DiM addresses the computational challenges of traditional diffusion transformers (DiT) by avoiding attention mechanisms and instead using a scalable alternative, achieving linear complexity with respect to sequence length. This approach not only reduces computational overhead but also maintains high efficiency, making it suitable for generating high-resolution images and videos. The paper presents a comprehensive analysis of existing models, including state space models, diffusion models, and diffusion transformers, and details the development and implementation of DiM. Extensive experimental results on datasets like ImageNet and UCF-101 demonstrate DiM's superior performance and efficiency, setting new benchmarks for image and video generation. The paper concludes by highlighting the potential of DiM in various applications, such as media production and education, while also acknowledging limitations and areas for further exploration.

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

24 May 2024 | Shentong Mo, Yapeng Tian