Distilling Diffusion Models into Conditional GANs

Distilling Diffusion Models into Conditional GANs

17 Jul 2024 | Minguk Kang1,2, Richard Zhang2, Connelly Barnes2, Sylvain Paris2, Suha Kwak1, Jaesik Park3, Eli Shechtman2, Jun-Yan Zhu4, and Taesung Park2
The paper "Distilling Diffusion Models into Conditional GANs" proposes a method to transform a complex multi-step diffusion model into a single-step conditional GAN (Diffusion2GAN), significantly accelerating inference while maintaining image quality. The approach treats diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs from the diffusion model's ODE trajectory. To efficiently compute the regression loss, the authors introduce E-LatentLPIPS, a perceptual loss operating directly in the diffusion model's latent space, utilizing ensemble augmentations. They also adapt the diffusion model to construct a multi-scale discriminator with a text alignment loss, enhancing the conditional GAN formulation. E-LatentLPIPS converges more efficiently than existing distillation methods, even considering dataset construction costs. The method is evaluated on the zero-shot COCO benchmark, outperforming cutting-edge one-step diffusion distillation models like DMD, SDXL-Turbo, and SDXL-Lightning. The paper includes extensive ablation studies and demonstrates the effectiveness of E-LatentLPIPS and the multi-scale diffusion discriminator. The authors also show that Diffusion2GAN can be applied to larger models, such as SDXL, achieving superior FID and CLIP-score compared to one-step SDXL-Turbo and SDXL-Lightning.The paper "Distilling Diffusion Models into Conditional GANs" proposes a method to transform a complex multi-step diffusion model into a single-step conditional GAN (Diffusion2GAN), significantly accelerating inference while maintaining image quality. The approach treats diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs from the diffusion model's ODE trajectory. To efficiently compute the regression loss, the authors introduce E-LatentLPIPS, a perceptual loss operating directly in the diffusion model's latent space, utilizing ensemble augmentations. They also adapt the diffusion model to construct a multi-scale discriminator with a text alignment loss, enhancing the conditional GAN formulation. E-LatentLPIPS converges more efficiently than existing distillation methods, even considering dataset construction costs. The method is evaluated on the zero-shot COCO benchmark, outperforming cutting-edge one-step diffusion distillation models like DMD, SDXL-Turbo, and SDXL-Lightning. The paper includes extensive ablation studies and demonstrates the effectiveness of E-LatentLPIPS and the multi-scale diffusion discriminator. The authors also show that Diffusion2GAN can be applied to larger models, such as SDXL, achieving superior FID and CLIP-score compared to one-step SDXL-Turbo and SDXL-Lightning.
Reach us at info@study.space
[slides and audio] Distilling Diffusion Models into Conditional GANs