[slides] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

This paper presents a scaling analysis of rectified flow models for high-resolution image synthesis. The authors propose a novel timestep sampling method for rectified flow training, which improves over existing diffusion training formulations and retains the favorable properties of rectified flows in the few-step sampling regime. They also introduce a novel transformer-based architecture, MM-DiT, designed to handle the multi-modal nature of text-to-image tasks. The architecture uses separate weights for text and image tokens, enabling bidirectional information flow and enhancing text comprehension, typography, and human preference ratings. The authors conduct a large-scale study to demonstrate the superior performance of their approach compared to established diffusion formulations for high-resolution text-to-image synthesis. They show that their largest models outperform state-of-the-art models in various metrics and human evaluations, and they make their experimental data, code, and model weights publicly available. The paper concludes with a discussion of the broader impact of their work on the field of machine learning and image synthesis.This paper presents a scaling analysis of rectified flow models for high-resolution image synthesis. The authors propose a novel timestep sampling method for rectified flow training, which improves over existing diffusion training formulations and retains the favorable properties of rectified flows in the few-step sampling regime. They also introduce a novel transformer-based architecture, MM-DiT, designed to handle the multi-modal nature of text-to-image tasks. The architecture uses separate weights for text and image tokens, enabling bidirectional information flow and enhancing text comprehension, typography, and human preference ratings. The authors conduct a large-scale study to demonstrate the superior performance of their approach compared to established diffusion formulations for high-resolution text-to-image synthesis. They show that their largest models outperform state-of-the-art models in various metrics and human evaluations, and they make their experimental data, code, and model weights publicly available. The paper concludes with a discussion of the broader impact of their work on the field of machine learning and image synthesis.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

5 Mar 2024 | Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach