Understanding Scalable Diffusion Models with State Space Backbone

This paper introduces a novel category of diffusion models, called Diffusion State Space Models (DiS), which are built upon state space architecture. DiS models replace the traditional U-Net backbone with a state space backbone, operating on raw patches or latent space. The key innovation is treating all inputs, including time, condition, and noisy image patches, as tokens, allowing for efficient handling of long-range dependencies. The authors conduct extensive experiments on both unconditional and class-conditional image generation tasks, demonstrating that DiS models achieve comparable or superior performance to CNN-based or Transformer-based U-Net architectures of similar size. They also analyze the scalability of DiS models, showing that increasing the model's complexity through depth or width augmentation consistently improves performance, as measured by lower FID scores. Additionally, DiS models in latent space achieve state-of-the-art performance on class-conditional ImageNet benchmarks at resolutions of 256×256 and 512×512, while significantly reducing computational overhead. The paper provides a comprehensive overview of the methodology, experimental setup, and results, highlighting the potential of DiS models in image generation tasks.This paper introduces a novel category of diffusion models, called Diffusion State Space Models (DiS), which are built upon state space architecture. DiS models replace the traditional U-Net backbone with a state space backbone, operating on raw patches or latent space. The key innovation is treating all inputs, including time, condition, and noisy image patches, as tokens, allowing for efficient handling of long-range dependencies. The authors conduct extensive experiments on both unconditional and class-conditional image generation tasks, demonstrating that DiS models achieve comparable or superior performance to CNN-based or Transformer-based U-Net architectures of similar size. They also analyze the scalability of DiS models, showing that increasing the model's complexity through depth or width augmentation consistently improves performance, as measured by lower FID scores. Additionally, DiS models in latent space achieve state-of-the-art performance on class-conditional ImageNet benchmarks at resolutions of 256×256 and 512×512, while significantly reducing computational overhead. The paper provides a comprehensive overview of the methodology, experimental setup, and results, highlighting the potential of DiS models in image generation tasks.

Scalable Diffusion Models with State Space Backbone

28 Mar 2024 | Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang*