Dimba is a new text-to-image diffusion model that combines Transformer and Mamba elements in a hybrid architecture. The model alternates between Transformer and Mamba layers in stacked blocks and integrates conditional information through cross-attention. This hybrid design allows Dimba to leverage the strengths of both architectures, achieving high image quality, artistic rendering, and semantic control. The model is flexible and can be adapted to various resource constraints and objectives. When scaled appropriately, Dimba offers improved throughput and reduced memory footprint compared to conventional Transformer-based models. Extensive experiments show that Dimba performs comparably to benchmarks in terms of image quality, artistic rendering, and semantic control. The model's architecture also exhibits intriguing properties, and checkpoints are released for further research. The paper highlights the potential of large-scale hybrid Transformer-Mamba architectures in diffusion models, suggesting a promising future for text-to-image generation. The model is implemented with a hybrid architecture that combines Transformer and Mamba layers, and the paper presents a detailed analysis of the model's design, training strategy, and performance comparisons with other text-to-image models. The model is trained on a large-scale dataset with high-quality image-text pairs, and the training process includes quality tuning and resolution adaptation. The paper also discusses the limitations of the model, including potential biases in the training data and challenges in generating specific styles and scenes. Overall, Dimba demonstrates strong performance in text-to-image generation, with a focus on balancing performance and memory requirements while maintaining high throughput. The paper concludes that Dimba is a promising diffusion model for text-to-image generation, with potential for future research in hybrid attention-Mamba backbone generation.Dimba is a new text-to-image diffusion model that combines Transformer and Mamba elements in a hybrid architecture. The model alternates between Transformer and Mamba layers in stacked blocks and integrates conditional information through cross-attention. This hybrid design allows Dimba to leverage the strengths of both architectures, achieving high image quality, artistic rendering, and semantic control. The model is flexible and can be adapted to various resource constraints and objectives. When scaled appropriately, Dimba offers improved throughput and reduced memory footprint compared to conventional Transformer-based models. Extensive experiments show that Dimba performs comparably to benchmarks in terms of image quality, artistic rendering, and semantic control. The model's architecture also exhibits intriguing properties, and checkpoints are released for further research. The paper highlights the potential of large-scale hybrid Transformer-Mamba architectures in diffusion models, suggesting a promising future for text-to-image generation. The model is implemented with a hybrid architecture that combines Transformer and Mamba layers, and the paper presents a detailed analysis of the model's design, training strategy, and performance comparisons with other text-to-image models. The model is trained on a large-scale dataset with high-quality image-text pairs, and the training process includes quality tuning and resolution adaptation. The paper also discusses the limitations of the model, including potential biases in the training data and challenges in generating specific styles and scenes. Overall, Dimba demonstrates strong performance in text-to-image generation, with a focus on balancing performance and memory requirements while maintaining high throughput. The paper concludes that Dimba is a promising diffusion model for text-to-image generation, with potential for future research in hybrid attention-Mamba backbone generation.