SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

17 Apr 2024 | Yuda Song, Zehao Sun, Xuanwu Yin
The paper introduces SDXS, a real-time one-step latent diffusion model designed for efficient image generation. SDXS addresses the limitations of existing diffusion models, which are characterized by complex architectures and high computational demands, leading to significant latency. The proposed method involves two main components: model miniaturization and a reduction in sampling steps. Knowledge distillation is used to streamline the U-Net and image decoder architectures, while an innovative one-step DM training technique, combining feature matching and score distillation, is introduced to reduce the number of function evaluations (NFEs). Two models, SDXS-512 and SDXS-1024, are developed, achieving inference speeds of approximately 100 FPS and 30 FPS, respectively, on a single GPU. The training approach also facilitates efficient image-conditioned control, making it suitable for tasks like image-to-image translation. The paper provides a comprehensive exploration of the challenges in deploying diffusion models on low-power devices and offers a detailed methodology for achieving real-time inference with high-quality image generation.The paper introduces SDXS, a real-time one-step latent diffusion model designed for efficient image generation. SDXS addresses the limitations of existing diffusion models, which are characterized by complex architectures and high computational demands, leading to significant latency. The proposed method involves two main components: model miniaturization and a reduction in sampling steps. Knowledge distillation is used to streamline the U-Net and image decoder architectures, while an innovative one-step DM training technique, combining feature matching and score distillation, is introduced to reduce the number of function evaluations (NFEs). Two models, SDXS-512 and SDXS-1024, are developed, achieving inference speeds of approximately 100 FPS and 30 FPS, respectively, on a single GPU. The training approach also facilitates efficient image-conditioned control, making it suitable for tasks like image-to-image translation. The paper provides a comprehensive exploration of the challenges in deploying diffusion models on low-power devices and offers a detailed methodology for achieving real-time inference with high-quality image generation.
Reach us at info@study.space