DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

10 Jul 2024 | Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
Diffusion Mamba (DiM) is a new diffusion model that combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. DiM addresses the challenge that Mamba cannot generalize to 2D signals by introducing multi-directional scans, learnable padding tokens, and lightweight local feature enhancement. The model achieves inference-time efficiency for high-resolution images and uses a "weak-to-strong" training strategy, pretraining on low-resolution images and fine-tuning on high-resolution images. Additionally, training-free upsampling strategies enable the model to generate higher-resolution images without further fine-tuning. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness and efficiency of DiM. The model achieves comparable performance to other transformer-based and SSM-based diffusion models, demonstrating its effectiveness and training efficiency. DiM also shows the ability to generate high-resolution images up to 1536 × 1536 without further fine-tuning. The model's architecture enables efficient inference for high-resolution image generation, and the experiments show that DiM can generate high-quality images with a resolution of 1024 × 1024 and 1536 × 1536. The model's design allows it to process 2D images efficiently, and the experiments show that DiM can generate high-resolution images with a resolution of 1024 × 1024 and 1536 × 1536. The model's performance is comparable to other diffusion models, and the experiments show that DiM can generate high-quality images with a resolution of 1024 × 1024 and 1536 × 1536. The model's design allows it to process 2D images efficiently, and the experiments show that DiM can generate high-resolution images with a resolution of 1024 × 1024 and 1536 × 1536.Diffusion Mamba (DiM) is a new diffusion model that combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. DiM addresses the challenge that Mamba cannot generalize to 2D signals by introducing multi-directional scans, learnable padding tokens, and lightweight local feature enhancement. The model achieves inference-time efficiency for high-resolution images and uses a "weak-to-strong" training strategy, pretraining on low-resolution images and fine-tuning on high-resolution images. Additionally, training-free upsampling strategies enable the model to generate higher-resolution images without further fine-tuning. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness and efficiency of DiM. The model achieves comparable performance to other transformer-based and SSM-based diffusion models, demonstrating its effectiveness and training efficiency. DiM also shows the ability to generate high-resolution images up to 1536 × 1536 without further fine-tuning. The model's architecture enables efficient inference for high-resolution image generation, and the experiments show that DiM can generate high-quality images with a resolution of 1024 × 1024 and 1536 × 1536. The model's design allows it to process 2D images efficiently, and the experiments show that DiM can generate high-resolution images with a resolution of 1024 × 1024 and 1536 × 1536. The model's performance is comparable to other diffusion models, and the experiments show that DiM can generate high-quality images with a resolution of 1024 × 1024 and 1536 × 1536. The model's design allows it to process 2D images efficiently, and the experiments show that DiM can generate high-resolution images with a resolution of 1024 × 1024 and 1536 × 1536.
Reach us at info@study.space