Distilling Diffusion Models into Conditional GANs

Distilling Diffusion Models into Conditional GANs

17 Jul 2024 | Minguk Kang¹², Richard Zhang², Connelly Barnes², Sylvain Paris², Suha Kwak¹, Jaesik Park³, Eli Shechtman², Jun-Yan Zhu⁴, and Taesung Park²
This paper proposes a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, significantly accelerating inference while preserving image quality. The approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs from the diffusion model's ODE trajectory. A perceptual loss, E-LatentLPIPS, is introduced to operate directly in the diffusion model's latent space, using an ensemble of augmentations for efficient regression loss computation. A multi-scale discriminator is adapted to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than existing distillation methods. The one-step generator outperforms state-of-the-art one-step diffusion distillation models on the zero-shot COCO benchmark. The method distills a pre-trained diffusion model into a one-step generator by learning a mapping from input noise and text to the diffusion model output. The approach treats the task as a paired image-to-image translation problem, enabling the use of perceptual losses and conditional GANs. The method achieves strong performance with just a regression loss, comparable to guided progressive distillation. A multi-scale conditional diffusion discriminator is developed, leveraging pre-trained weights and using a new single-sample R1 loss and mix-and-match augmentation. The distillation model is named Diffusion2GAN. Diffusion2GAN is used to distill Stable Diffusion 1.5 into a single-step conditional GAN model, outperforming other distillation methods on the zero-shot COCO benchmark. The method is also effective in distilling a larger SDXL model, achieving superior FID and CLIP-score compared to one-step SDXL-Turbo and SDXL-Lightning. The method is evaluated on multiple datasets and shows improved performance in terms of image realism, text-to-image alignment, and inference speed. The method is also compared to other distillation models, showing that Diffusion2GAN achieves better performance in terms of FID and CLIP-score. The method is also evaluated in terms of training speed, showing that Diffusion2GAN converges more efficiently than existing distillation methods. The method is also evaluated in terms of diversity, showing that Diffusion2GAN generates more diverse images compared to other models. The method is also evaluated in terms of human preference, showing that Diffusion2GAN is preferred over other models in terms of image realism and text-to-image alignment. The method is also evaluated in terms of societal impact, showing that the technology can improve the accessibility and affordability of generative visual models.This paper proposes a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, significantly accelerating inference while preserving image quality. The approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs from the diffusion model's ODE trajectory. A perceptual loss, E-LatentLPIPS, is introduced to operate directly in the diffusion model's latent space, using an ensemble of augmentations for efficient regression loss computation. A multi-scale discriminator is adapted to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than existing distillation methods. The one-step generator outperforms state-of-the-art one-step diffusion distillation models on the zero-shot COCO benchmark. The method distills a pre-trained diffusion model into a one-step generator by learning a mapping from input noise and text to the diffusion model output. The approach treats the task as a paired image-to-image translation problem, enabling the use of perceptual losses and conditional GANs. The method achieves strong performance with just a regression loss, comparable to guided progressive distillation. A multi-scale conditional diffusion discriminator is developed, leveraging pre-trained weights and using a new single-sample R1 loss and mix-and-match augmentation. The distillation model is named Diffusion2GAN. Diffusion2GAN is used to distill Stable Diffusion 1.5 into a single-step conditional GAN model, outperforming other distillation methods on the zero-shot COCO benchmark. The method is also effective in distilling a larger SDXL model, achieving superior FID and CLIP-score compared to one-step SDXL-Turbo and SDXL-Lightning. The method is evaluated on multiple datasets and shows improved performance in terms of image realism, text-to-image alignment, and inference speed. The method is also compared to other distillation models, showing that Diffusion2GAN achieves better performance in terms of FID and CLIP-score. The method is also evaluated in terms of training speed, showing that Diffusion2GAN converges more efficiently than existing distillation methods. The method is also evaluated in terms of diversity, showing that Diffusion2GAN generates more diverse images compared to other models. The method is also evaluated in terms of human preference, showing that Diffusion2GAN is preferred over other models in terms of image realism and text-to-image alignment. The method is also evaluated in terms of societal impact, showing that the technology can improve the accessibility and affordability of generative visual models.
Reach us at info@study.space
[slides and audio] Distilling Diffusion Models into Conditional GANs