The paper introduces PIXART-Σ, a Diffusion Transformer model capable of directly generating 4K resolution images. It builds upon the pre-trained PIXART-α model, enhancing it through "weak-to-strong training," which involves incorporating higher-quality data and efficient token compression. This approach significantly improves image quality and user prompt adherence while reducing model size to 0.6B parameters, compared to existing models like SDXL (2.6B) and SD Cascade (5.1B). PIXART-Σ achieves high-resolution image generation with minimal training cost and model parameters, demonstrating superior performance in aesthetic quality and text-image alignment. The model's ability to generate 4K images supports high-resolution content creation in industries like film and gaming. Key innovations include high-quality training data, efficient token compression, and a weak-to-strong training strategy. The model excels in generating high-fidelity images with detailed text alignment, outperforming existing models in both image quality and prompt-following capabilities. The paper also discusses the technical details of the model's design, including efficient DiT architecture, KV token compression, and training strategies for adapting to new VAEs, higher resolutions, and KV compression. Experimental results show that PIXART-Σ achieves competitive performance with state-of-the-art T2I models, demonstrating its effectiveness in generating high-resolution images with minimal computational resources. The model's efficiency and performance make it a promising solution for high-resolution image generation.The paper introduces PIXART-Σ, a Diffusion Transformer model capable of directly generating 4K resolution images. It builds upon the pre-trained PIXART-α model, enhancing it through "weak-to-strong training," which involves incorporating higher-quality data and efficient token compression. This approach significantly improves image quality and user prompt adherence while reducing model size to 0.6B parameters, compared to existing models like SDXL (2.6B) and SD Cascade (5.1B). PIXART-Σ achieves high-resolution image generation with minimal training cost and model parameters, demonstrating superior performance in aesthetic quality and text-image alignment. The model's ability to generate 4K images supports high-resolution content creation in industries like film and gaming. Key innovations include high-quality training data, efficient token compression, and a weak-to-strong training strategy. The model excels in generating high-fidelity images with detailed text alignment, outperforming existing models in both image quality and prompt-following capabilities. The paper also discusses the technical details of the model's design, including efficient DiT architecture, KV token compression, and training strategies for adapting to new VAEs, higher resolutions, and KV compression. Experimental results show that PIXART-Σ achieves competitive performance with state-of-the-art T2I models, demonstrating its effectiveness in generating high-resolution images with minimal computational resources. The model's efficiency and performance make it a promising solution for high-resolution image generation.