This paper introduces Diffusion State Space Models (DiS), a novel diffusion model architecture that replaces the traditional U-Net backbone with a state space backbone. DiS treats all inputs, including time, condition, and noisy image patches, as discrete tokens and processes them through a bidirectional Mamba architecture. The model demonstrates comparable or superior performance to CNN-based or Transformer-based U-Net architectures of similar size in both unconditional and class-conditional image generation tasks. DiS exhibits strong scalability, with higher Gflops models achieving lower FID scores. DiS-H/2 models in latent space achieve performance levels similar to prior diffusion models on ImageNet benchmarks at resolutions of 256×256 and 512×512, while significantly reducing computational burden. The model is evaluated on CIFAR10, CelebA, and ImageNet datasets, showing strong performance in both unconditional and class-conditional image generation. DiS achieves state-of-the-art FID scores on class-conditional ImageNet datasets, outperforming other models. The model is efficient, scalable, and effective for large-scale cross-modal datasets. The code and models are available at https://github.com/feizc/DiS.This paper introduces Diffusion State Space Models (DiS), a novel diffusion model architecture that replaces the traditional U-Net backbone with a state space backbone. DiS treats all inputs, including time, condition, and noisy image patches, as discrete tokens and processes them through a bidirectional Mamba architecture. The model demonstrates comparable or superior performance to CNN-based or Transformer-based U-Net architectures of similar size in both unconditional and class-conditional image generation tasks. DiS exhibits strong scalability, with higher Gflops models achieving lower FID scores. DiS-H/2 models in latent space achieve performance levels similar to prior diffusion models on ImageNet benchmarks at resolutions of 256×256 and 512×512, while significantly reducing computational burden. The model is evaluated on CIFAR10, CelebA, and ImageNet datasets, showing strong performance in both unconditional and class-conditional image generation. DiS achieves state-of-the-art FID scores on class-conditional ImageNet datasets, outperforming other models. The model is efficient, scalable, and effective for large-scale cross-modal datasets. The code and models are available at https://github.com/feizc/DiS.