6 Jun 2024 | Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom
This paper introduces a new method for accelerating diffusion models by distilling many-step diffusion models into few-step models through moment matching. The method matches conditional expectations of the clean data given noisy data along the sampling trajectory. This approach extends previously proposed one-step distillation methods to the multi-step case and provides a new perspective by interpreting these methods in terms of moment matching. By using up to 8 sampling steps, the distilled models outperform both their one-step versions and their original many-step teacher models, achieving new state-of-the-art results on the ImageNet dataset. The method is also effective on a large text-to-image model, enabling fast generation of high-resolution images directly in image space without the need for autoencoders or upsamplers. The proposed method is based on moment matching, which allows for the distillation of diffusion models without the need for an auxiliary model, and can be implemented using two independent minibatches per parameter update. The method is evaluated on the ImageNet dataset and a large text-to-image model, showing significant improvements in image quality and generation speed. The results demonstrate that the proposed method is effective in distilling diffusion models and can be used to generate high-quality images efficiently.This paper introduces a new method for accelerating diffusion models by distilling many-step diffusion models into few-step models through moment matching. The method matches conditional expectations of the clean data given noisy data along the sampling trajectory. This approach extends previously proposed one-step distillation methods to the multi-step case and provides a new perspective by interpreting these methods in terms of moment matching. By using up to 8 sampling steps, the distilled models outperform both their one-step versions and their original many-step teacher models, achieving new state-of-the-art results on the ImageNet dataset. The method is also effective on a large text-to-image model, enabling fast generation of high-resolution images directly in image space without the need for autoencoders or upsamplers. The proposed method is based on moment matching, which allows for the distillation of diffusion models without the need for an auxiliary model, and can be implemented using two independent minibatches per parameter update. The method is evaluated on the ImageNet dataset and a large text-to-image model, showing significant improvements in image quality and generation speed. The results demonstrate that the proposed method is effective in distilling diffusion models and can be used to generate high-quality images efficiently.