Zero-Shot Text-to-Image Generation

Zero-Shot Text-to-Image Generation

26 Feb 2021 | Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
This paper presents a simple approach for text-to-image generation using a transformer that autoregressively models text and image tokens as a single stream of data. The approach involves two stages: first, training a discrete variational autoencoder (dVAE) to compress images into a grid of image tokens, and second, training an autoregressive transformer to model the joint distribution over text and image tokens. The resulting system achieves high-quality image generation on the MS-COCO dataset without using any of the training labels, and is preferred over prior work trained on the dataset by human evaluators 90% of the time. The model is also able to perform complex tasks such as image-to-image translation at a rudimentary level. The model is trained on 250 million image-text pairs collected from the internet, and is a 12-billion parameter autoregressive transformer. The model is evaluated on the MS-COCO and CUB datasets, and achieves strong results on both. The model is also able to generate images that are realistic and match the captions, and can perform zero-shot image-to-image translation controlled by natural language. The model is trained using mixed-precision training and distributed optimization techniques to handle the large scale of the model. The results show that the model achieves high-quality image generation and is able to generalize well to new categories and tasks.This paper presents a simple approach for text-to-image generation using a transformer that autoregressively models text and image tokens as a single stream of data. The approach involves two stages: first, training a discrete variational autoencoder (dVAE) to compress images into a grid of image tokens, and second, training an autoregressive transformer to model the joint distribution over text and image tokens. The resulting system achieves high-quality image generation on the MS-COCO dataset without using any of the training labels, and is preferred over prior work trained on the dataset by human evaluators 90% of the time. The model is also able to perform complex tasks such as image-to-image translation at a rudimentary level. The model is trained on 250 million image-text pairs collected from the internet, and is a 12-billion parameter autoregressive transformer. The model is evaluated on the MS-COCO and CUB datasets, and achieves strong results on both. The model is also able to generate images that are realistic and match the captions, and can perform zero-shot image-to-image translation controlled by natural language. The model is trained using mixed-precision training and distributed optimization techniques to handle the large scale of the model. The results show that the model achieves high-quality image generation and is able to generalize well to new categories and tasks.
Reach us at info@study.space