Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

22 Jun 2022 | Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu
The Pathways Autoregressive Text-to-Image (Parti) model generates high-fidelity, photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence problem, using image tokens as outputs rather than text tokens. It employs a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Scaling the encoder-decoder Transformer model to 20B parameters improves performance, achieving a zero-shot FID score of 7.23 and a finetuned FID score of 3.22 on MS-COCO. Parti also performs well on the Localized Narratives dataset, which has longer descriptions than MS-COCO. Parti is a sequence-to-sequence model based on the Transformer, with an encoder and decoder. The image tokenizer, ViT-VQGAN, produces higher-fidelity outputs and better codebook utilization than dVAE or VQ-VAE. Parti uses a ViT-VQGAN image tokenizer to encode images into sequences of discrete tokens, which are then decoded into images. The model is trained on a combination of image-text datasets, including LAION-400M, FIT400M, and JFT-4B. Parti achieves state-of-the-art performance on MS-COCO and Localized Narratives, with the 20B model achieving a zero-shot FID score of 7.23 and a finetuned FID score of 3.22 on MS-COCO. It also performs well on the PartiPrompts (P2) benchmark, which includes over 1600 English prompts across 12 categories and 11 challenge aspects. Parti outperforms other models in terms of image realism and image-text alignment. Parti is trained using data parallelism and in-layer model parallelism, with the 20B model using 16-stage GSPMD pipelines for training. The model is evaluated on MS-COCO and Localized Narratives, with results showing that Parti outperforms retrieval baselines and other models. Parti also performs well in human evaluations, with 91.7% preference for image realism and 90.5% for image-text match compared to XMC-GAN. Parti is a scalable model that can be trained on large datasets and scaled to 20B parameters. It achieves high-quality image generation and supports content-rich synthesis, particularly for complex compositions and world knowledge. Parti is also effective in generating images for open-domain text-to-image generation, with the ability to handle a wide range of prompts and challenges. The model is evaluated on multiple benchmarks, including MS-COCO, Localized Narratives, and PartiPrompts, and shows strong performance in terms of image quality, alignmentThe Pathways Autoregressive Text-to-Image (Parti) model generates high-fidelity, photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence problem, using image tokens as outputs rather than text tokens. It employs a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Scaling the encoder-decoder Transformer model to 20B parameters improves performance, achieving a zero-shot FID score of 7.23 and a finetuned FID score of 3.22 on MS-COCO. Parti also performs well on the Localized Narratives dataset, which has longer descriptions than MS-COCO. Parti is a sequence-to-sequence model based on the Transformer, with an encoder and decoder. The image tokenizer, ViT-VQGAN, produces higher-fidelity outputs and better codebook utilization than dVAE or VQ-VAE. Parti uses a ViT-VQGAN image tokenizer to encode images into sequences of discrete tokens, which are then decoded into images. The model is trained on a combination of image-text datasets, including LAION-400M, FIT400M, and JFT-4B. Parti achieves state-of-the-art performance on MS-COCO and Localized Narratives, with the 20B model achieving a zero-shot FID score of 7.23 and a finetuned FID score of 3.22 on MS-COCO. It also performs well on the PartiPrompts (P2) benchmark, which includes over 1600 English prompts across 12 categories and 11 challenge aspects. Parti outperforms other models in terms of image realism and image-text alignment. Parti is trained using data parallelism and in-layer model parallelism, with the 20B model using 16-stage GSPMD pipelines for training. The model is evaluated on MS-COCO and Localized Narratives, with results showing that Parti outperforms retrieval baselines and other models. Parti also performs well in human evaluations, with 91.7% preference for image realism and 90.5% for image-text match compared to XMC-GAN. Parti is a scalable model that can be trained on large datasets and scaled to 20B parameters. It achieves high-quality image generation and supports content-rich synthesis, particularly for complex compositions and world knowledge. Parti is also effective in generating images for open-domain text-to-image generation, with the ability to handle a wide range of prompts and challenges. The model is evaluated on multiple benchmarks, including MS-COCO, Localized Narratives, and PartiPrompts, and shows strong performance in terms of image quality, alignment
Reach us at info@study.space
[slides] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation | StudySpace