23 May 2022 | Chitwan Saharia*, William Chan*, Saurabh Saxena†, Lala Li†, Jay Whang†, Emily Denton, Seyed Kamvar Seyed Ghaseimpour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho†, David J Fleet†, Mohammad Norouzi*
Imagen is a text-to-image diffusion model that achieves unprecedented photorealism and deep language understanding. It leverages large transformer language models for text encoding and diffusion models for image generation. Key findings include that increasing the size of the language model significantly improves image quality and alignment more than increasing the size of the image diffusion model. Imagen achieves a state-of-the-art FID score of 7.27 on the COCO dataset without training on it, and human raters find its samples to be on par with COCO data in image-text alignment.
DrawBench, a new benchmark for text-to-image models, shows that Imagen outperforms other models like VQ-GAN+CLIP, Latent Diffusion Models, GLIDE, and DALL-E 2 in both sample quality and image-text alignment. Imagen uses dynamic thresholding to generate more realistic images with high guidance weights, and Efficient U-Net for faster and more memory-efficient training. It also uses noise conditioning augmentation to improve sample quality and robustness.
Imagen's text encoder, based on T5-XXL, outperforms CLIP in image-text alignment on DrawBench. The model uses a cascaded diffusion approach to generate images from 64x64 to 256x256 and then to 1024x1024. It achieves high fidelity and alignment, with human evaluations showing strong performance. However, it has limitations in generating realistic images of people and may encode social biases.
Imagen's training data includes image and English alt-text pairs, but recent audits revealed inappropriate content, leading to concerns about its public use. The model inherits social biases from large language models and may reproduce stereotypes. Despite these limitations, Imagen represents a significant advancement in text-to-image generation, with potential applications in creative fields but also risks of misuse. The authors emphasize the need for responsible development and evaluation of such models.Imagen is a text-to-image diffusion model that achieves unprecedented photorealism and deep language understanding. It leverages large transformer language models for text encoding and diffusion models for image generation. Key findings include that increasing the size of the language model significantly improves image quality and alignment more than increasing the size of the image diffusion model. Imagen achieves a state-of-the-art FID score of 7.27 on the COCO dataset without training on it, and human raters find its samples to be on par with COCO data in image-text alignment.
DrawBench, a new benchmark for text-to-image models, shows that Imagen outperforms other models like VQ-GAN+CLIP, Latent Diffusion Models, GLIDE, and DALL-E 2 in both sample quality and image-text alignment. Imagen uses dynamic thresholding to generate more realistic images with high guidance weights, and Efficient U-Net for faster and more memory-efficient training. It also uses noise conditioning augmentation to improve sample quality and robustness.
Imagen's text encoder, based on T5-XXL, outperforms CLIP in image-text alignment on DrawBench. The model uses a cascaded diffusion approach to generate images from 64x64 to 256x256 and then to 1024x1024. It achieves high fidelity and alignment, with human evaluations showing strong performance. However, it has limitations in generating realistic images of people and may encode social biases.
Imagen's training data includes image and English alt-text pairs, but recent audits revealed inappropriate content, leading to concerns about its public use. The model inherits social biases from large language models and may reproduce stereotypes. Despite these limitations, Imagen represents a significant advancement in text-to-image generation, with potential applications in creative fields but also risks of misuse. The authors emphasize the need for responsible development and evaluation of such models.