SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

4 Jul 2023 | Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
SDXL is a latent diffusion model for text-to-image synthesis that significantly improves upon previous versions of Stable Diffusion. It features a three times larger UNet backbone, with more attention blocks and a larger cross-attention context due to the use of a second text encoder. SDXL is trained on multiple aspect ratios and includes a refinement model that enhances the visual fidelity of generated images using a post-hoc image-to-image technique. The model achieves results competitive with black-box state-of-the-art image generators and is open-sourced to promote research and transparency. The improvements in SDXL include a more powerful pre-trained text encoder, micro-conditioning techniques for image size and cropping parameters, and multi-aspect training to handle various image resolutions. The model also uses an improved autoencoder to enhance local, high-frequency details in generated images. A refinement stage is introduced to further improve the quality of generated images by applying a noising-denoising process. SDXL demonstrates significantly improved performance compared to previous versions of Stable Diffusion, with user studies showing it outperforms all previous versions by a significant margin. However, classical performance metrics like FID and CLIP scores do not fully reflect the improvements, as they are not suitable for evaluating foundational text-to-image diffusion models. Despite this, SDXL achieves competitive results with black-box image generation models and is open-sourced for further research and development.SDXL is a latent diffusion model for text-to-image synthesis that significantly improves upon previous versions of Stable Diffusion. It features a three times larger UNet backbone, with more attention blocks and a larger cross-attention context due to the use of a second text encoder. SDXL is trained on multiple aspect ratios and includes a refinement model that enhances the visual fidelity of generated images using a post-hoc image-to-image technique. The model achieves results competitive with black-box state-of-the-art image generators and is open-sourced to promote research and transparency. The improvements in SDXL include a more powerful pre-trained text encoder, micro-conditioning techniques for image size and cropping parameters, and multi-aspect training to handle various image resolutions. The model also uses an improved autoencoder to enhance local, high-frequency details in generated images. A refinement stage is introduced to further improve the quality of generated images by applying a noising-denoising process. SDXL demonstrates significantly improved performance compared to previous versions of Stable Diffusion, with user studies showing it outperforms all previous versions by a significant margin. However, classical performance metrics like FID and CLIP scores do not fully reflect the improvements, as they are not suitable for evaluating foundational text-to-image diffusion models. Despite this, SDXL achieves competitive results with black-box image generation models and is open-sourced for further research and development.
Reach us at info@study.space