PIXART-δ: FAST AND CONTROLLABLE IMAGE GENERATION WITH LATENT CONSISTENCY MODELS

PIXART-δ: FAST AND CONTROLLABLE IMAGE GENERATION WITH LATENT CONSISTENCY MODELS

10 Jan 2024 | Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li
This technical report introduces PIXArt-δ, an advanced text-to-image synthesis framework that integrates Latent Consistency Models (LCM) and ControlNet into the existing PIXArt-α model. PIXArt-α is known for its efficient training process and high-quality 1024px image generation. The integration of LCM in PIXArt-δ significantly accelerates inference speed, enabling the production of high-quality images in just 2-4 steps, with a breakthrough 0.5 seconds for generating 1024 × 1024 pixel images, a 7× improvement over PIXArt-α. PIXArt-δ can be efficiently trained on 32GB V100 GPUs within a day and supports 8-bit inference, allowing it to synthesize 1024px images within 8GB GPU memory constraints. Additionally, the report introduces a novel ControlNet-Transformer architecture, tailored for Transformers, which enables fine-grained control over text-to-image diffusion models. The ControlNet-Transformer architecture achieves explicit controllability and high-quality image generation, making PIXArt-δ a promising alternative to the Stable Diffusion family of models. The report includes detailed training algorithms, ablation studies, and experimental results demonstrating the effectiveness of PIXArt-δ in terms of speed, quality, and controllability.This technical report introduces PIXArt-δ, an advanced text-to-image synthesis framework that integrates Latent Consistency Models (LCM) and ControlNet into the existing PIXArt-α model. PIXArt-α is known for its efficient training process and high-quality 1024px image generation. The integration of LCM in PIXArt-δ significantly accelerates inference speed, enabling the production of high-quality images in just 2-4 steps, with a breakthrough 0.5 seconds for generating 1024 × 1024 pixel images, a 7× improvement over PIXArt-α. PIXArt-δ can be efficiently trained on 32GB V100 GPUs within a day and supports 8-bit inference, allowing it to synthesize 1024px images within 8GB GPU memory constraints. Additionally, the report introduces a novel ControlNet-Transformer architecture, tailored for Transformers, which enables fine-grained control over text-to-image diffusion models. The ControlNet-Transformer architecture achieves explicit controllability and high-quality image generation, making PIXArt-δ a promising alternative to the Stable Diffusion family of models. The report includes detailed training algorithms, ablation studies, and experimental results demonstrating the effectiveness of PIXArt-δ in terms of speed, quality, and controllability.
Reach us at info@study.space