Understanding Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

This paper introduces Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach that improves upon the limitations of adversarial diffusion distillation (ADD). LADD utilizes generative features from pretrained latent diffusion models, simplifying training and enhancing performance, enabling high-resolution multi-aspect ratio image synthesis. The authors apply LADD to Stable Diffusion 3 (SD3) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. LADD's effectiveness is demonstrated in various applications such as image editing and inpainting. The method eliminates the need for decoding back to image space, significantly reducing memory demands and enabling high-resolution image synthesis. LADD is compared to other distillation approaches, including Consistency Distillation (LCM), and is found to outperform them in terms of performance and scalability. The paper also presents results on image-to-image tasks, showing that LADD can be applied to instruction-guided image editing and inpainting. The authors highlight the trade-offs between model capacity, prompt alignment, and inference speed, and note that while SD3-Turbo maintains the teacher's image quality in four steps, it sacrifices prompt alignment. The paper concludes with a discussion of the limitations of the approach and future research directions.This paper introduces Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach that improves upon the limitations of adversarial diffusion distillation (ADD). LADD utilizes generative features from pretrained latent diffusion models, simplifying training and enhancing performance, enabling high-resolution multi-aspect ratio image synthesis. The authors apply LADD to Stable Diffusion 3 (SD3) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. LADD's effectiveness is demonstrated in various applications such as image editing and inpainting. The method eliminates the need for decoding back to image space, significantly reducing memory demands and enabling high-resolution image synthesis. LADD is compared to other distillation approaches, including Consistency Distillation (LCM), and is found to outperform them in terms of performance and scalability. The paper also presents results on image-to-image tasks, showing that LADD can be applied to instruction-guided image editing and inpainting. The authors highlight the trade-offs between model capacity, prompt alignment, and inference speed, and note that while SD3-Turbo maintains the teacher's image quality in four steps, it sacrifices prompt alignment. The paper concludes with a discussion of the limitations of the approach and future research directions.

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

18 Mar 2024 | Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach