[slides] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Imagen is a text-to-image diffusion model that combines the power of large transformer language models and high-fidelity diffusion models to achieve unprecedented photorealism and deep language understanding. The key discovery is that generic large language models (e.g., T5) pretrained on text-only corpora are surprisingly effective at encoding text for image synthesis. Scaling the size of these language models significantly improves sample fidelity and image-text alignment more than scaling the image diffusion model. Imagen achieves a state-of-the-art FID score of 7.27 on the COCO dataset without any training on COCO, and human raters find its samples to be on par with the reference images in terms of image-text alignment. To assess text-to-image models more deeply, the authors introduce DrawBench, a comprehensive benchmark with 200 prompts across 11 categories, testing different capabilities of models. Human evaluation shows that Imagen outperforms other recent methods, including DALL-E 2, GLIDE, and VQ-GAN+CLIP. Key contributions include the discovery that large frozen language models are effective text encoders, the introduction of dynamic thresholding for better photorealism, the development of an efficient U-Net architecture, and the introduction of DrawBench for comprehensive evaluation.Imagen is a text-to-image diffusion model that combines the power of large transformer language models and high-fidelity diffusion models to achieve unprecedented photorealism and deep language understanding. The key discovery is that generic large language models (e.g., T5) pretrained on text-only corpora are surprisingly effective at encoding text for image synthesis. Scaling the size of these language models significantly improves sample fidelity and image-text alignment more than scaling the image diffusion model. Imagen achieves a state-of-the-art FID score of 7.27 on the COCO dataset without any training on COCO, and human raters find its samples to be on par with the reference images in terms of image-text alignment. To assess text-to-image models more deeply, the authors introduce DrawBench, a comprehensive benchmark with 200 prompts across 11 categories, testing different capabilities of models. Human evaluation shows that Imagen outperforms other recent methods, including DALL-E 2, GLIDE, and VQ-GAN+CLIP. Key contributions include the discovery that large frozen language models are effective text encoders, the introduction of dynamic thresholding for better photorealism, the development of an efficient U-Net architecture, and the introduction of DrawBench for comprehensive evaluation.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

23 May 2022 | Chitwan Saharia*, William Chan*, Saurabh Saxena†, Lala Li†, Jay Whang†, Emily Denton, Seyed Kamvar Seyed Ghaseimpour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho†, David J Fleet†, Mohammad Norouzi*

23 May 2022 | Chitwan Saharia, William Chan, Saurabh Saxena†, Lala Li†, Jay Whang†, Emily Denton, Seyed Kamvar Seyed Ghaseimpour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho†, David J Fleet†, Mohammad Norouzi*