Understanding Zero-Shot Text-to-Image Generation

The paper presents a novel approach to text-to-image generation using an autoregressive transformer. The method models text and image tokens as a single stream of data, addressing the challenge of handling high-resolution images efficiently. The process involves two stages: first, a discrete variational autoencoder (dVAE) compresses images into a 32x32 grid of image tokens, reducing the context size of the transformer. Second, the transformer models the joint distribution over text and image tokens. The model is trained on a large dataset of 250 million text-image pairs collected from the internet, achieving high-quality image generation on the MS-COCO dataset without using any training labels. The model outperforms previous domain-specific approaches in zero-shot evaluation and demonstrates capabilities such as complex image-to-image translation and anthropomorphization of animals. The paper also discusses the challenges and solutions in training such large-scale models, including mixed-precision training and distributed optimization.The paper presents a novel approach to text-to-image generation using an autoregressive transformer. The method models text and image tokens as a single stream of data, addressing the challenge of handling high-resolution images efficiently. The process involves two stages: first, a discrete variational autoencoder (dVAE) compresses images into a 32x32 grid of image tokens, reducing the context size of the transformer. Second, the transformer models the joint distribution over text and image tokens. The model is trained on a large dataset of 250 million text-image pairs collected from the internet, achieving high-quality image generation on the MS-COCO dataset without using any training labels. The model outperforms previous domain-specific approaches in zero-shot evaluation and demonstrates capabilities such as complex image-to-image translation and anthropomorphization of animals. The paper also discusses the challenges and solutions in training such large-scale models, including mixed-precision training and distributed optimization.

Zero-Shot Text-to-Image Generation

26 Feb 2021 | Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever