This paper introduces Dimba, a novel text-to-image diffusion model that combines Transformer and Mamba layers in a hybrid architecture. Dimba alternates between Transformer and Mamba layers in a stacked manner, integrating conditional information through cross-attention layers. The model is designed to balance throughput, memory usage, and quality, offering flexibility for various resource constraints. Extensive experiments demonstrate that Dimba achieves comparable performance to existing benchmarks in terms of image quality, artistic rendering, and semantic control. Key contributions include the presentation of Dimba, a large-scale high-quality image-text dataset, and a staged progressive training strategy. The paper also explores optimization techniques such as quality tuning and resolution adaptation, highlighting the promise of hybrid Transformer-Mamba architectures in text-to-image generation.This paper introduces Dimba, a novel text-to-image diffusion model that combines Transformer and Mamba layers in a hybrid architecture. Dimba alternates between Transformer and Mamba layers in a stacked manner, integrating conditional information through cross-attention layers. The model is designed to balance throughput, memory usage, and quality, offering flexibility for various resource constraints. Extensive experiments demonstrate that Dimba achieves comparable performance to existing benchmarks in terms of image quality, artistic rendering, and semantic control. Key contributions include the presentation of Dimba, a large-scale high-quality image-text dataset, and a staged progressive training strategy. The paper also explores optimization techniques such as quality tuning and resolution adaptation, highlighting the promise of hybrid Transformer-Mamba architectures in text-to-image generation.