23 Jun 2021 | Patrick Esser*, Robin Rombach*, Björn Ommer
This paper addresses the challenge of using transformers for high-resolution image synthesis, which is computationally infeasible due to their quadratic complexity in sequence length. The authors propose a two-stage approach that combines the strengths of convolutional neural networks (CNNs) and transformers. In the first stage, a CNN-based model learns a context-rich vocabulary of image constituents using a VQGAN (Vector Quantized Variational Autoencoder). This vocabulary is then used to represent images as a composition of these constituents in the second stage, where a transformer model efficiently models their global compositions. The approach is applied to both unconditional and conditional synthesis tasks, demonstrating state-of-the-art results on megapixel images and outperforming previous convolutional approaches. The method is flexible and can be adapted to various image synthesis tasks, including semantic image synthesis, structure-to-image translation, pose-guided synthesis, stochastic super-resolution, and class-conditional image synthesis. Experiments show that the proposed approach retains the advantages of transformers while leveraging the efficiency of CNNs, achieving high-quality and consistent image generation.This paper addresses the challenge of using transformers for high-resolution image synthesis, which is computationally infeasible due to their quadratic complexity in sequence length. The authors propose a two-stage approach that combines the strengths of convolutional neural networks (CNNs) and transformers. In the first stage, a CNN-based model learns a context-rich vocabulary of image constituents using a VQGAN (Vector Quantized Variational Autoencoder). This vocabulary is then used to represent images as a composition of these constituents in the second stage, where a transformer model efficiently models their global compositions. The approach is applied to both unconditional and conditional synthesis tasks, demonstrating state-of-the-art results on megapixel images and outperforming previous convolutional approaches. The method is flexible and can be adapted to various image synthesis tasks, including semantic image synthesis, structure-to-image translation, pose-guided synthesis, stochastic super-resolution, and class-conditional image synthesis. Experiments show that the proposed approach retains the advantages of transformers while leveraging the efficiency of CNNs, achieving high-quality and consistent image generation.