Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

5 Mar 2024 | Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach
This paper presents a study on scaling rectified flow transformers for high-resolution image synthesis. The authors propose a novel approach to improve rectified flow models by biasing noise sampling towards perceptually relevant scales. They demonstrate that this approach outperforms established diffusion formulations in high-resolution text-to-image synthesis. Additionally, they introduce a new transformer-based architecture for text-to-image generation that enables bidirectional information flow between image and text tokens, improving text comprehension, typography, and human preference ratings. The architecture is shown to follow predictable scaling trends and correlates lower validation loss with improved text-to-image synthesis. The authors' largest models outperform state-of-the-art models, and they make their experimental data, code, and model weights publicly available. The paper also presents a simulation-free training method for flows, which involves defining a mapping between samples from a noise distribution and samples from a data distribution using an ordinary differential equation. The authors propose different variants of this formalism, including Rectified Flow, EDM, Cosine, and LDM-Linear, and evaluate their performance. They find that rectified flow models with tailored SNR samplers perform better than other formulations, especially when sampling intermediate timesteps. The authors also present a text-to-image architecture that uses separate weights for the two modalities and enables bidirectional information flow between image and text tokens. They show that this architecture outperforms existing backbones such as UViT and DiT. The model is shown to follow predictable scaling trends and correlates lower validation loss with improved text-to-image performance. The authors' largest models outperform state-of-the-art models, and they make their experimental data, code, and model weights publicly available. The paper also presents results from experiments on image and video domains, showing that increasing model size and training steps leads to a smooth decrease in validation loss. The authors find that validation loss is highly correlated with comprehensive evaluation metrics and human preference. Their largest model outperforms current open and proprietary models in human preference evaluations. The paper concludes that rectified flow models show no signs of saturation and that further improvements are possible.This paper presents a study on scaling rectified flow transformers for high-resolution image synthesis. The authors propose a novel approach to improve rectified flow models by biasing noise sampling towards perceptually relevant scales. They demonstrate that this approach outperforms established diffusion formulations in high-resolution text-to-image synthesis. Additionally, they introduce a new transformer-based architecture for text-to-image generation that enables bidirectional information flow between image and text tokens, improving text comprehension, typography, and human preference ratings. The architecture is shown to follow predictable scaling trends and correlates lower validation loss with improved text-to-image synthesis. The authors' largest models outperform state-of-the-art models, and they make their experimental data, code, and model weights publicly available. The paper also presents a simulation-free training method for flows, which involves defining a mapping between samples from a noise distribution and samples from a data distribution using an ordinary differential equation. The authors propose different variants of this formalism, including Rectified Flow, EDM, Cosine, and LDM-Linear, and evaluate their performance. They find that rectified flow models with tailored SNR samplers perform better than other formulations, especially when sampling intermediate timesteps. The authors also present a text-to-image architecture that uses separate weights for the two modalities and enables bidirectional information flow between image and text tokens. They show that this architecture outperforms existing backbones such as UViT and DiT. The model is shown to follow predictable scaling trends and correlates lower validation loss with improved text-to-image performance. The authors' largest models outperform state-of-the-art models, and they make their experimental data, code, and model weights publicly available. The paper also presents results from experiments on image and video domains, showing that increasing model size and training steps leads to a smooth decrease in validation loss. The authors find that validation loss is highly correlated with comprehensive evaluation metrics and human preference. Their largest model outperforms current open and proprietary models in human preference evaluations. The paper concludes that rectified flow models show no signs of saturation and that further improvements are possible.
Reach us at info@study.space