Taming Transformers for High-Resolution Image Synthesis

Taming Transformers for High-Resolution Image Synthesis

23 Jun 2021 | Patrick Esser*, Robin Rombach*, Björn Ommer
This paper presents a method for high-resolution image synthesis using transformers, combining the strengths of convolutional neural networks (CNNs) and transformers. The approach uses a convolutional VQGAN to learn a codebook of context-rich visual parts, which are then modeled using an autoregressive transformer architecture. This allows the model to efficiently synthesize high-resolution images while retaining the flexibility of transformers. The method is effective for both unconditional and conditional synthesis tasks, where additional information such as object classes or spatial layouts can control the generated image. The approach is evaluated on various tasks, including semantic image synthesis, structure-to-image, pose-guided synthesis, and class-conditional synthesis. The results show that the method outperforms previous approaches in terms of image quality and efficiency, enabling the generation of high-resolution images with transformers. The method is also shown to be effective for generating images from semantic layouts, with strong results on the S-FLCKR dataset. The approach is evaluated on a wide range of tasks and datasets, demonstrating its versatility and effectiveness in high-resolution image synthesis.This paper presents a method for high-resolution image synthesis using transformers, combining the strengths of convolutional neural networks (CNNs) and transformers. The approach uses a convolutional VQGAN to learn a codebook of context-rich visual parts, which are then modeled using an autoregressive transformer architecture. This allows the model to efficiently synthesize high-resolution images while retaining the flexibility of transformers. The method is effective for both unconditional and conditional synthesis tasks, where additional information such as object classes or spatial layouts can control the generated image. The approach is evaluated on various tasks, including semantic image synthesis, structure-to-image, pose-guided synthesis, and class-conditional synthesis. The results show that the method outperforms previous approaches in terms of image quality and efficiency, enabling the generation of high-resolution images with transformers. The method is also shown to be effective for generating images from semantic layouts, with strong results on the S-FLCKR dataset. The approach is evaluated on a wide range of tasks and datasets, demonstrating its versatility and effectiveness in high-resolution image synthesis.
Reach us at info@study.space