CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

8 Mar 2024 | Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang
CogView3 is a novel text-to-image generation system that employs relay diffusion, a cascaded framework that enhances performance by first generating low-resolution images and then applying relay-based super-resolution. This approach reduces both training and inference costs while maintaining high-quality outputs. CogView3 outperforms the current state-of-the-art open-source model SDXL in human evaluations, achieving a 77.0% win rate with about half the inference time. A distilled variant of CogView3 achieves comparable performance with just 1/10 of the inference time of SDXL. The system uses a 3-billion parameter text-to-image diffusion model with a 3-stage UNet architecture, operating in the latent image space. It first generates images at 512×512 resolution and then performs 2× super-resolution to generate 1024×1024 images. The super-resolution stage can be iteratively applied to achieve higher resolutions like 2048×2048. The relay diffusion framework allows for efficient generation by starting diffusion from low-resolution images corrupted with Gaussian noise, enabling the super-resolution stage to correct artifacts from the previous stage. CogView3 also incorporates text expansion to bridge the gap between model training and inference, improving prompt alignment and aesthetic quality. The system uses progressive distillation to reduce inference time while preserving generation quality. The distilled variant of CogView3 achieves significant performance improvements with drastically reduced inference time. Experiments show that CogView3 outperforms SDXL and Stable Cascade in both machine and human evaluations. The system achieves high-quality images with reduced computational costs, demonstrating the effectiveness of relay diffusion in text-to-image generation. The results highlight the potential of relay diffusion for efficient and high-quality text-to-image generation.CogView3 is a novel text-to-image generation system that employs relay diffusion, a cascaded framework that enhances performance by first generating low-resolution images and then applying relay-based super-resolution. This approach reduces both training and inference costs while maintaining high-quality outputs. CogView3 outperforms the current state-of-the-art open-source model SDXL in human evaluations, achieving a 77.0% win rate with about half the inference time. A distilled variant of CogView3 achieves comparable performance with just 1/10 of the inference time of SDXL. The system uses a 3-billion parameter text-to-image diffusion model with a 3-stage UNet architecture, operating in the latent image space. It first generates images at 512×512 resolution and then performs 2× super-resolution to generate 1024×1024 images. The super-resolution stage can be iteratively applied to achieve higher resolutions like 2048×2048. The relay diffusion framework allows for efficient generation by starting diffusion from low-resolution images corrupted with Gaussian noise, enabling the super-resolution stage to correct artifacts from the previous stage. CogView3 also incorporates text expansion to bridge the gap between model training and inference, improving prompt alignment and aesthetic quality. The system uses progressive distillation to reduce inference time while preserving generation quality. The distilled variant of CogView3 achieves significant performance improvements with drastically reduced inference time. Experiments show that CogView3 outperforms SDXL and Stable Cascade in both machine and human evaluations. The system achieves high-quality images with reduced computational costs, demonstrating the effectiveness of relay diffusion in text-to-image generation. The results highlight the potential of relay diffusion for efficient and high-quality text-to-image generation.
Reach us at info@study.space