8 Mar 2024 | Wendi Zheng1*†, Jiayan Teng1*†, Zhuoyi Yang1†, Weihan Wang1†, Jidong Chen1†, Xiaotao Gu2, Yuxiao Dong1†, Ming Ding2†, and Jie Tang1†
**Abstract:**
Recent advancements in text-to-image generative systems have been driven by diffusion models, but single-stage models still face challenges in computational efficiency and image detail refinement. To address these issues, the authors propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model to implement relay diffusion in text-to-image generation, creating low-resolution images first and then applying relay-based super-resolution. This approach not only improves the quality of generated images but also significantly reduces training and inference costs. Experimental results show that CogView3 outperforms the current state-of-the-art model, SDXL, by 77.0% in human evaluations while requiring only about half the inference time. The distilled variant of CogView3 achieves comparable performance with only 1/10 of the inference time required by SDXL.
**Keywords:**
Text-to-Image Generation · Diffusion Models
**Introduction:**
Diffusion models have become the mainstream framework in text-to-image generation, offering effective solutions for visual generation tasks. However, single-stage models often struggle with high inference costs and image detail refinement. CogView3 addresses these issues by employing relay diffusion, a cascaded diffusion framework that generates low-resolution images first and then performs super-resolution. This method allows for more efficient and high-quality image generation, especially at high resolutions like 2048 × 2048.
**Method:**
CogView3 is a 3-billion parameter text-to-image diffusion model with a 3-stage UNet architecture. It operates in the latent image space, using a variational KL-regularized autoencoder to compress images. The model is trained progressively, with the base stage generating 512 × 512 images and the super-resolution stage performing 2 × super-resolution to produce 1024 × 1024 images. The super-resolution stage can be iteratively applied to generate even higher resolutions.
**Experiments:**
CogView3 is evaluated using various datasets and metrics, including MS-COCO, DrawBench, and PartiPrompts. The results show that CogView3 outperforms SDXL and Stable Cascade in terms of image quality and human preference. The distilled variant of CogView3, which reduces inference time significantly, also achieves comparable performance.
**Conclusion:**
CogView3 is the first text-to-image generation system to use relay diffusion, achieving high-quality image generation with reduced inference costs. The model's ability to generate extremely high-resolution images and its superior performance in human evaluations make it a significant advancement in the field of text-to-image generation.**Abstract:**
Recent advancements in text-to-image generative systems have been driven by diffusion models, but single-stage models still face challenges in computational efficiency and image detail refinement. To address these issues, the authors propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model to implement relay diffusion in text-to-image generation, creating low-resolution images first and then applying relay-based super-resolution. This approach not only improves the quality of generated images but also significantly reduces training and inference costs. Experimental results show that CogView3 outperforms the current state-of-the-art model, SDXL, by 77.0% in human evaluations while requiring only about half the inference time. The distilled variant of CogView3 achieves comparable performance with only 1/10 of the inference time required by SDXL.
**Keywords:**
Text-to-Image Generation · Diffusion Models
**Introduction:**
Diffusion models have become the mainstream framework in text-to-image generation, offering effective solutions for visual generation tasks. However, single-stage models often struggle with high inference costs and image detail refinement. CogView3 addresses these issues by employing relay diffusion, a cascaded diffusion framework that generates low-resolution images first and then performs super-resolution. This method allows for more efficient and high-quality image generation, especially at high resolutions like 2048 × 2048.
**Method:**
CogView3 is a 3-billion parameter text-to-image diffusion model with a 3-stage UNet architecture. It operates in the latent image space, using a variational KL-regularized autoencoder to compress images. The model is trained progressively, with the base stage generating 512 × 512 images and the super-resolution stage performing 2 × super-resolution to produce 1024 × 1024 images. The super-resolution stage can be iteratively applied to generate even higher resolutions.
**Experiments:**
CogView3 is evaluated using various datasets and metrics, including MS-COCO, DrawBench, and PartiPrompts. The results show that CogView3 outperforms SDXL and Stable Cascade in terms of image quality and human preference. The distilled variant of CogView3, which reduces inference time significantly, also achieves comparable performance.
**Conclusion:**
CogView3 is the first text-to-image generation system to use relay diffusion, achieving high-quality image generation with reduced inference costs. The model's ability to generate extremely high-resolution images and its superior performance in human evaluations make it a significant advancement in the field of text-to-image generation.