27 May 2024 | Sirui Xie, Zhisheng Xiao, Diederik P. Kingma, Tingbo Hou, Ying Nian Wu, Kevin Murphy, Tim Salimans, Ben Poole, Ruiqi Gao
EM Distillation (EMD) is a method for distilling a pre-trained diffusion model into a one-step generator model with minimal loss of perceptual quality. The approach is based on the Expectation-Maximization (EM) framework, where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. EMD introduces a reparametrized sampling scheme and a noise cancellation technique to stabilize the distillation process. It also reveals an interesting connection with existing methods that minimize mode-seeking KL divergence. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
The paper proposes EMD, a diffusion distillation method that minimizes an approximation of the mode-covering divergence between a pre-trained diffusion teacher model and a latent-variable student model. The student enables efficient generation by mapping from noise to data in just one step. To achieve Maximum Likelihood Estimation (MLE) of the marginal teacher distribution for the student, the paper proposes a method similar to the Expectation-Maximization (EM) framework. This method alternates between an Expectation-step (E-step) that estimates the learning gradients with Monte Carlo samples, and a Maximization-step (M-step) that updates the student through gradient ascent.
The paper also discusses the connection between EMD and existing methods such as Variational Score Distillation (VSD) and Diff-Instruct. It shows how the strength of the MCMC sampling scheme can interpolate between mode-seeking and mode-covering divergences. Empirically, the paper demonstrates that a special case of EMD, which is equivalent to the Diff-Instruct baseline, can be readily scaled and improved to achieve strong performance. The general formulation of EMD that leverages multi-step MCMC can achieve even more competitive results. For ImageNet-64 and ImageNet-128 conditional generation, EMD outperforms existing one-step generation approaches with FID scores of 2.20 and 6.0. EMD also performs favorably on one-step text-to-image generation by distilling from Stable Diffusion models.EM Distillation (EMD) is a method for distilling a pre-trained diffusion model into a one-step generator model with minimal loss of perceptual quality. The approach is based on the Expectation-Maximization (EM) framework, where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. EMD introduces a reparametrized sampling scheme and a noise cancellation technique to stabilize the distillation process. It also reveals an interesting connection with existing methods that minimize mode-seeking KL divergence. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
The paper proposes EMD, a diffusion distillation method that minimizes an approximation of the mode-covering divergence between a pre-trained diffusion teacher model and a latent-variable student model. The student enables efficient generation by mapping from noise to data in just one step. To achieve Maximum Likelihood Estimation (MLE) of the marginal teacher distribution for the student, the paper proposes a method similar to the Expectation-Maximization (EM) framework. This method alternates between an Expectation-step (E-step) that estimates the learning gradients with Monte Carlo samples, and a Maximization-step (M-step) that updates the student through gradient ascent.
The paper also discusses the connection between EMD and existing methods such as Variational Score Distillation (VSD) and Diff-Instruct. It shows how the strength of the MCMC sampling scheme can interpolate between mode-seeking and mode-covering divergences. Empirically, the paper demonstrates that a special case of EMD, which is equivalent to the Diff-Instruct baseline, can be readily scaled and improved to achieve strong performance. The general formulation of EMD that leverages multi-step MCMC can achieve even more competitive results. For ImageNet-64 and ImageNet-128 conditional generation, EMD outperforms existing one-step generation approaches with FID scores of 2.20 and 6.0. EMD also performs favorably on one-step text-to-image generation by distilling from Stable Diffusion models.