SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

16 Jan 2024 | Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers SiT is a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework allows for connecting two distributions in a more flexible way than standard diffusion models, enabling a modular study of various design choices impacting generative models based on dynamical transport. These choices include using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing these ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06. The paper explores the design choices of generative models, including the choice of interpolant, model prediction, and sampling method. It shows that by moving from discrete to continuous time, changing the model prediction, interpolant, and the choice of sampler, there is a consistent performance improvement over DiT. The paper also shows that the SDE for the interpolant can be instantiated using just a velocity model, which is used to push the performance of these methods beyond previous results. The paper also explores the impact of different interpolants, such as Linear and GVP, on performance. It shows that both the GVP and Linear interpolants obtain significantly improved performance. The paper also explores the impact of deterministic vs. stochastic sampling, showing that sampling with an SDE over the ODE leads to better performance. The paper also explores the impact of different diffusion coefficients, showing that the optimal choice for sampling is both model prediction and interpolant dependent. The paper also explores the impact of classifier-free guidance, showing that it leads to improved performance for score-based models. The paper shows that SiT benefits from the same training and sampling choices explored previously and can surpass DiT performance in each training setting, not only with respect to model size, but also with respect to sampling choices. The paper also explores related work in the field of transformers and diffusion models, showing that the interpolant framework is a promising direction for future research. The paper concludes that SiT is a simple and powerful framework for image generation tasks, and that careful decisions in design choices can lead to significant performance improvements.SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers SiT is a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework allows for connecting two distributions in a more flexible way than standard diffusion models, enabling a modular study of various design choices impacting generative models based on dynamical transport. These choices include using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing these ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06. The paper explores the design choices of generative models, including the choice of interpolant, model prediction, and sampling method. It shows that by moving from discrete to continuous time, changing the model prediction, interpolant, and the choice of sampler, there is a consistent performance improvement over DiT. The paper also shows that the SDE for the interpolant can be instantiated using just a velocity model, which is used to push the performance of these methods beyond previous results. The paper also explores the impact of different interpolants, such as Linear and GVP, on performance. It shows that both the GVP and Linear interpolants obtain significantly improved performance. The paper also explores the impact of deterministic vs. stochastic sampling, showing that sampling with an SDE over the ODE leads to better performance. The paper also explores the impact of different diffusion coefficients, showing that the optimal choice for sampling is both model prediction and interpolant dependent. The paper also explores the impact of classifier-free guidance, showing that it leads to improved performance for score-based models. The paper shows that SiT benefits from the same training and sampling choices explored previously and can surpass DiT performance in each training setting, not only with respect to model size, but also with respect to sampling choices. The paper also explores related work in the field of transformers and diffusion models, showing that the interpolant framework is a promising direction for future research. The paper concludes that SiT is a simple and powerful framework for image generation tasks, and that careful decisions in design choices can lead to significant performance improvements.
Reach us at info@study.space
Understanding SiT%3A Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers