Understanding PixArt-%CE%A3%3A Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

**PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation** This paper introduces PixArt-Σ, a Diffusion Transformer (DiT) model capable of generating 4K resolution images directly. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering higher fidelity and improved alignment with text prompts. The key features of PixArt-Σ include: 1. **High-Quality Training Data**: PixArt-Σ incorporates superior-quality image data, including 33M high-resolution images (over 1K resolution) and more precise, detailed image captions. 2. **Efficient Token Compression**: A novel attention module within the DiT framework compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. 3. **Weak-to-Strong Training Strategy**: The model evolves from a 'weaker' baseline to a 'stronger' model through efficient training, leveraging higher-quality data and efficient token compression. These advancements result in superior image quality and user prompt adherence capabilities with a significantly smaller model size (0.6B parameters) compared to existing text-to-image diffusion models like SDXL (2.6B parameters) and SD Cascade (5.1B parameters). PixArt-Σ's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, enhancing visual content production in industries such as film and gaming. The paper also discusses related work, including diffusion transformers, high-resolution image generation, and efficient Transformer architectures. It provides a detailed methodology, experimental setup, and performance comparisons, demonstrating PixArt-Σ's effectiveness in generating high-quality, photo-realistic images with intricate details and accurate alignment with textual prompts.**PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation** This paper introduces PixArt-Σ, a Diffusion Transformer (DiT) model capable of generating 4K resolution images directly. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering higher fidelity and improved alignment with text prompts. The key features of PixArt-Σ include: 1. **High-Quality Training Data**: PixArt-Σ incorporates superior-quality image data, including 33M high-resolution images (over 1K resolution) and more precise, detailed image captions. 2. **Efficient Token Compression**: A novel attention module within the DiT framework compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. 3. **Weak-to-Strong Training Strategy**: The model evolves from a 'weaker' baseline to a 'stronger' model through efficient training, leveraging higher-quality data and efficient token compression. These advancements result in superior image quality and user prompt adherence capabilities with a significantly smaller model size (0.6B parameters) compared to existing text-to-image diffusion models like SDXL (2.6B parameters) and SD Cascade (5.1B parameters). PixArt-Σ's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, enhancing visual content production in industries such as film and gaming. The paper also discusses related work, including diffusion transformers, high-resolution image generation, and efficient Transformer architectures. It provides a detailed methodology, experimental setup, and performance comparisons, demonstrating PixArt-Σ's effectiveness in generating high-quality, photo-realistic images with intricate details and accurate alignment with textual prompts.

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

17 Mar 2024 | Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li