10 Jul 2024 | Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
**Diffusion Mamba (DiM)** is a novel diffusion model that combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. The primary challenge addressed is the mismatch between Mamba's causal sequential modeling and the 2D data structure of images. To overcome this, DiM introduces multi-directional scans, learnable padding tokens, and lightweight local feature enhancements. These designs enable DiM to handle spatial structure effectively while maintaining linear complexity in inference time.
To improve training efficiency for high-resolution image generation, DiM employs a "weak-to-strong" training strategy, where the model is pre-trained on low-resolution images (256×256) and then fine-tuned on higher resolutions (512×512). Additionally, training-free upsampling techniques are explored to generate higher-resolution images (1024×1024 and 1536×1536) without further fine-tuning.
Experiments on ImageNet and CIFAR-10 datasets demonstrate the effectiveness and efficiency of DiM. On ImageNet, DiM-Huge achieves comparable performance to other transformer-based and SSM-based diffusion models, despite being trained with fewer iterations. Training-free upsampling further enhances the model's ability to generate high-resolution images.
The paper also includes ablation studies to validate the contributions of various architectural designs and concludes with a discussion on limitations and broader impacts of image generation models.**Diffusion Mamba (DiM)** is a novel diffusion model that combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. The primary challenge addressed is the mismatch between Mamba's causal sequential modeling and the 2D data structure of images. To overcome this, DiM introduces multi-directional scans, learnable padding tokens, and lightweight local feature enhancements. These designs enable DiM to handle spatial structure effectively while maintaining linear complexity in inference time.
To improve training efficiency for high-resolution image generation, DiM employs a "weak-to-strong" training strategy, where the model is pre-trained on low-resolution images (256×256) and then fine-tuned on higher resolutions (512×512). Additionally, training-free upsampling techniques are explored to generate higher-resolution images (1024×1024 and 1536×1536) without further fine-tuning.
Experiments on ImageNet and CIFAR-10 datasets demonstrate the effectiveness and efficiency of DiM. On ImageNet, DiM-Huge achieves comparable performance to other transformer-based and SSM-based diffusion models, despite being trained with fewer iterations. Training-free upsampling further enhances the model's ability to generate high-resolution images.
The paper also includes ablation studies to validate the contributions of various architectural designs and concludes with a discussion on limitations and broader impacts of image generation models.