10 Jan 2024 | Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li
This technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-δ achieves a breakthrough of 0.5 seconds for generating 1024×1024 pixel images, marking a 7× improvement over PIXART-α. Additionally, PIXART-δ is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability, PIXART-δ can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-δ offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.
In this technical report, we propose PIXART-δ, which incorporates LCM and ControlNet into PIXART-α. Notably, PIXART-α is an advanced high-quality 1024px diffusion transformer text-to-image synthesis model, developed by our team, known for its superior image generation quality achieved through an exceptionally efficient training process. We incorporate LCM into PIXART-δ to accelerate the inference. LCM enables high-quality and fast inference with only 2~4 steps on pre-trained LDMs by viewing the reverse diffusion process as solving an augmented probability flow ODE, which enables PIXART-δ to generate samples within (≈4) steps while preserving high-quality generations. As a result, PIXART-δ takes 0.5 seconds per 1024×1024 image on an A100 GPU, improving the inference speed by 7× compared to PIXART-α. We also support LCM-LoRA for a better user experience and convenience.
In addition, we incorporate a ControlNet-like module into PIXART-δ. ControlNet demonstrates superior control over text-to-image diffusion models' outputs under various conditions. However, it's important to note that the model architecture of ControlNet is intricately designed for UNet-based diffusion models, and we observe that a direct replication of it into a Transformer model proves less effective. Consequently, we propose a novel ControlNet-Transformer architecture customized for the Transformer model. Our ControlNet-Transformer achievesThis technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-δ achieves a breakthrough of 0.5 seconds for generating 1024×1024 pixel images, marking a 7× improvement over PIXART-α. Additionally, PIXART-δ is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability, PIXART-δ can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-δ offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.
In this technical report, we propose PIXART-δ, which incorporates LCM and ControlNet into PIXART-α. Notably, PIXART-α is an advanced high-quality 1024px diffusion transformer text-to-image synthesis model, developed by our team, known for its superior image generation quality achieved through an exceptionally efficient training process. We incorporate LCM into PIXART-δ to accelerate the inference. LCM enables high-quality and fast inference with only 2~4 steps on pre-trained LDMs by viewing the reverse diffusion process as solving an augmented probability flow ODE, which enables PIXART-δ to generate samples within (≈4) steps while preserving high-quality generations. As a result, PIXART-δ takes 0.5 seconds per 1024×1024 image on an A100 GPU, improving the inference speed by 7× compared to PIXART-α. We also support LCM-LoRA for a better user experience and convenience.
In addition, we incorporate a ControlNet-like module into PIXART-δ. ControlNet demonstrates superior control over text-to-image diffusion models' outputs under various conditions. However, it's important to note that the model architecture of ControlNet is intricately designed for UNet-based diffusion models, and we observe that a direct replication of it into a Transformer model proves less effective. Consequently, we propose a novel ControlNet-Transformer architecture customized for the Transformer model. Our ControlNet-Transformer achieves