Elucidating the Design Space of Diffusion-Based Generative Models

Elucidating the Design Space of Diffusion-Based Generative Models

11 Oct 2022 | Tero Karras, Miika Aittala, Timo Aila, Samuli Laine
The paper aims to clarify the design space of diffusion-based generative models by separating concrete design choices, leading to significant improvements in sampling efficiency and image quality. The authors propose several modifications to the sampling and training processes, including a higher-order Runge-Kutta method for sampling and improved preconditioning of score networks. These improvements result in state-of-the-art Fréchet Inception Distance (FID) scores of 1.79 for CIFAR-10 and 1.97 for unconditional settings, with a much faster sampling rate of 35 network evaluations per image. The paper also demonstrates that these design changes significantly enhance the performance of pre-trained score networks, improving the FID from 2.07 to near-SOTA 1.55 for ImageNet-64 and achieving a new SOTA of 1.36 after retraining. The contributions are structured around a common framework for diffusion models, allowing for modular exploration of individual components and their interactions. The authors conclude that their approach enables easier innovation and targeted exploration of the design space, with potential applications in various domains such as audio, video, and language translation.The paper aims to clarify the design space of diffusion-based generative models by separating concrete design choices, leading to significant improvements in sampling efficiency and image quality. The authors propose several modifications to the sampling and training processes, including a higher-order Runge-Kutta method for sampling and improved preconditioning of score networks. These improvements result in state-of-the-art Fréchet Inception Distance (FID) scores of 1.79 for CIFAR-10 and 1.97 for unconditional settings, with a much faster sampling rate of 35 network evaluations per image. The paper also demonstrates that these design changes significantly enhance the performance of pre-trained score networks, improving the FID from 2.07 to near-SOTA 1.55 for ImageNet-64 and achieving a new SOTA of 1.36 after retraining. The contributions are structured around a common framework for diffusion models, allowing for modular exploration of individual components and their interactions. The authors conclude that their approach enables easier innovation and targeted exploration of the design space, with potential applications in various domains such as audio, video, and language translation.
Reach us at info@study.space