[slides and audio] SDXL%3A Improving Latent Diffusion Models for High-Resolution Image Synthesis

**SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis** **Authors:** Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach **Institution:** Stability AI, Applied Research **Abstract:** SDXL is a latent diffusion model for text-to-image synthesis that significantly improves upon previous versions of Stable Diffusion. It features a three times larger UNet backbone, more attention blocks, and a second text encoder. The model is trained on multiple aspect ratios and uses a refinement model to enhance visual fidelity. User studies show that SDXL outperforms previous versions of Stable Diffusion and competitive black-box image generators. The code and model weights are made publicly available to promote transparency and reproducibility. **Key Contributions:** 1. **Architecture and Scale:** SDXL uses a 3× larger UNet backbone with more attention blocks and a second text encoder, resulting in 2.6B parameters. 2. **Conditioning Schemes:** Two novel conditioning techniques are introduced: conditioning on image size and cropping parameters, which improve sample quality and reduce artifacts. 3. **Multi-Aspect Training:** The model is trained on multiple aspect ratios to handle a wider range of input sizes. 4. **Refinement Model:** A separate refinement model improves the visual quality of samples by applying a noising-denoising process. **Results:** - **User Studies:** SDXL significantly outperforms previous versions of Stable Diffusion and competitive black-box models in user preference tests. - **Performance Metrics:** While SDXL shows improved performance in user evaluations, it does not achieve better FID scores compared to previous versions, highlighting the need for alternative evaluation metrics. **Future Work:** - Single-stage improvements to reduce inference cost. - Enhancing text synthesis capabilities. - Exploring transformer-based architectures. - Reducing inference costs through distillation techniques. - Investigating continuous-time diffusion models for improved sampling flexibility. **Limitations:** - Challenges with complex structures like human hands. - Lack of perfect photorealism. - Potential biases in large-scale datasets. - Concept bleeding in complex scenes. - Text rendering limitations, particularly for long texts. **Conclusion:** SDXL represents a significant advancement in text-to-image synthesis, offering improved performance and visual fidelity. However, further research is needed to address limitations and enhance specific aspects of the model.**SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis** **Authors:** Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach **Institution:** Stability AI, Applied Research **Abstract:** SDXL is a latent diffusion model for text-to-image synthesis that significantly improves upon previous versions of Stable Diffusion. It features a three times larger UNet backbone, more attention blocks, and a second text encoder. The model is trained on multiple aspect ratios and uses a refinement model to enhance visual fidelity. User studies show that SDXL outperforms previous versions of Stable Diffusion and competitive black-box image generators. The code and model weights are made publicly available to promote transparency and reproducibility. **Key Contributions:** 1. **Architecture and Scale:** SDXL uses a 3× larger UNet backbone with more attention blocks and a second text encoder, resulting in 2.6B parameters. 2. **Conditioning Schemes:** Two novel conditioning techniques are introduced: conditioning on image size and cropping parameters, which improve sample quality and reduce artifacts. 3. **Multi-Aspect Training:** The model is trained on multiple aspect ratios to handle a wider range of input sizes. 4. **Refinement Model:** A separate refinement model improves the visual quality of samples by applying a noising-denoising process. **Results:** - **User Studies:** SDXL significantly outperforms previous versions of Stable Diffusion and competitive black-box models in user preference tests. - **Performance Metrics:** While SDXL shows improved performance in user evaluations, it does not achieve better FID scores compared to previous versions, highlighting the need for alternative evaluation metrics. **Future Work:** - Single-stage improvements to reduce inference cost. - Enhancing text synthesis capabilities. - Exploring transformer-based architectures. - Reducing inference costs through distillation techniques. - Investigating continuous-time diffusion models for improved sampling flexibility. **Limitations:** - Challenges with complex structures like human hands. - Lack of perfect photorealism. - Potential biases in large-scale datasets. - Concept bleeding in complex scenes. - Text rendering limitations, particularly for long texts. **Conclusion:** SDXL represents a significant advancement in text-to-image synthesis, offering improved performance and visual fidelity. However, further research is needed to address limitations and enhance specific aspects of the model.

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

4 Jul 2023 | Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach