Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

5 Mar 2024 | Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach
This paper presents a scaling analysis of rectified flow models for high-resolution image synthesis. The authors propose a novel timestep sampling method for rectified flow training, which improves over existing diffusion training formulations and retains the favorable properties of rectified flows in the few-step sampling regime. They also introduce a novel transformer-based architecture, MM-DiT, designed to handle the multi-modal nature of text-to-image tasks. The architecture uses separate weights for text and image tokens, enabling bidirectional information flow and enhancing text comprehension, typography, and human preference ratings. The authors conduct a large-scale study to demonstrate the superior performance of their approach compared to established diffusion formulations for high-resolution text-to-image synthesis. They show that their largest models outperform state-of-the-art models in various metrics and human evaluations, and they make their experimental data, code, and model weights publicly available. The paper concludes with a discussion of the broader impact of their work on the field of machine learning and image synthesis.This paper presents a scaling analysis of rectified flow models for high-resolution image synthesis. The authors propose a novel timestep sampling method for rectified flow training, which improves over existing diffusion training formulations and retains the favorable properties of rectified flows in the few-step sampling regime. They also introduce a novel transformer-based architecture, MM-DiT, designed to handle the multi-modal nature of text-to-image tasks. The architecture uses separate weights for text and image tokens, enabling bidirectional information flow and enhancing text comprehension, typography, and human preference ratings. The authors conduct a large-scale study to demonstrate the superior performance of their approach compared to established diffusion formulations for high-resolution text-to-image synthesis. They show that their largest models outperform state-of-the-art models in various metrics and human evaluations, and they make their experimental data, code, and model weights publicly available. The paper concludes with a discussion of the broader impact of their work on the field of machine learning and image synthesis.
Reach us at info@study.space
Understanding Scaling Rectified Flow Transformers for High-Resolution Image Synthesis