Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

22 Jun 2022 | Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu
The paper introduces the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images from text descriptions. Parti treats text-to-image generation as a sequence-to-sequence problem, using a Transformer-based image tokenizer (ViT-VQGAN) to encode images into discrete tokens. The model is scaled up to 20 billion parameters, achieving state-of-the-art zero-shot FID scores of 7.23 and finetuned FID scores of 3.22 on the MS-COCO dataset. Parti demonstrates strong generalization to longer descriptions and complex compositions, as evaluated on the Localized Narratives dataset. The authors also introduce PartiPrompts (P2), a benchmark with over 1600 English prompts to measure model capabilities across various categories and difficulty levels. The paper discusses the effectiveness of Parti, its limitations, and future directions for improving text-to-image generation models.The paper introduces the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images from text descriptions. Parti treats text-to-image generation as a sequence-to-sequence problem, using a Transformer-based image tokenizer (ViT-VQGAN) to encode images into discrete tokens. The model is scaled up to 20 billion parameters, achieving state-of-the-art zero-shot FID scores of 7.23 and finetuned FID scores of 3.22 on the MS-COCO dataset. Parti demonstrates strong generalization to longer descriptions and complex compositions, as evaluated on the Localized Narratives dataset. The authors also introduce PartiPrompts (P2), a benchmark with over 1600 English prompts to measure model capabilities across various categories and difficulty levels. The paper discusses the effectiveness of Parti, its limitations, and future directions for improving text-to-image generation models.
Reach us at info@study.space
[slides] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation | StudySpace