This paper introduces Diffusion Mamba (DiM), a novel architecture that integrates the computational efficiency of the Mamba state space model with the generative capabilities of diffusion processes. DiM replaces traditional attention mechanisms with a scalable alternative, achieving linear complexity with respect to sequence length and significantly reducing computational load. The architecture is designed for efficient image and video generation, outperforming existing diffusion transformers in both tasks. DiM's efficiency is demonstrated through extensive experiments on ImageNet and UCF-101 datasets, showing superior performance and lower computational footprint compared to state-of-the-art models like DiT. The architecture's scalability is further validated by its ability to handle high-resolution images and videos with reduced computational demands. DiM's bidirectional state space models enable effective handling of spatial and temporal dependencies, making it suitable for both image and video generation. The paper also presents ablation studies showing that DiM's efficiency is maintained across different model sizes and patch dimensions, with significant improvements in processing time and resource utilization. The results confirm that DiM can produce high-fidelity and diverse outputs, setting a new benchmark for efficient, scalable image and video generation. Limitations include the need for further testing on diverse video types and the potential for improved long-term dependency handling in extended video sequences. The DiM architecture has broad implications for media and entertainment, education, and other fields requiring high-quality visual content generation.This paper introduces Diffusion Mamba (DiM), a novel architecture that integrates the computational efficiency of the Mamba state space model with the generative capabilities of diffusion processes. DiM replaces traditional attention mechanisms with a scalable alternative, achieving linear complexity with respect to sequence length and significantly reducing computational load. The architecture is designed for efficient image and video generation, outperforming existing diffusion transformers in both tasks. DiM's efficiency is demonstrated through extensive experiments on ImageNet and UCF-101 datasets, showing superior performance and lower computational footprint compared to state-of-the-art models like DiT. The architecture's scalability is further validated by its ability to handle high-resolution images and videos with reduced computational demands. DiM's bidirectional state space models enable effective handling of spatial and temporal dependencies, making it suitable for both image and video generation. The paper also presents ablation studies showing that DiM's efficiency is maintained across different model sizes and patch dimensions, with significant improvements in processing time and resource utilization. The results confirm that DiM can produce high-fidelity and diverse outputs, setting a new benchmark for efficient, scalable image and video generation. Limitations include the need for further testing on diverse video types and the potential for improved long-term dependency handling in extended video sequences. The DiM architecture has broad implications for media and entertainment, education, and other fields requiring high-quality visual content generation.