13 Apr 2022 | Aditya Ramesh*, Prafulla Dhariwal*, Alex Nichol*, Casey Chu*, Mark Chen
This paper introduces unCLIP, a text-conditional image generation model that combines CLIP embeddings with diffusion models for image generation. The model consists of two components: a prior that generates CLIP image embeddings from text captions, and a decoder that generates images from these embeddings. The prior can be either autoregressive or diffusion-based, with diffusion-based priors being more computationally efficient and producing higher-quality samples. The decoder uses diffusion models to generate images conditioned on CLIP image embeddings, allowing for variations of an image that preserve its semantics and style while varying non-essential details. The joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. The model is evaluated on the MS-COCO dataset, achieving state-of-the-art results in terms of image quality and diversity. The model also demonstrates the ability to generate images with high fidelity and diversity, and is compared to other models such as DALL-E and GLIDE. The paper also discusses the limitations and risks of the model, including its difficulty in binding attributes to objects and its challenges in generating details in complex scenes. The model is shown to be effective in generating realistic images that capture the text prompts, and is compared to other models in terms of aesthetic quality and diversity. The paper concludes that unCLIP is a promising approach for text-conditional image generation, with the potential to be used in a variety of applications.This paper introduces unCLIP, a text-conditional image generation model that combines CLIP embeddings with diffusion models for image generation. The model consists of two components: a prior that generates CLIP image embeddings from text captions, and a decoder that generates images from these embeddings. The prior can be either autoregressive or diffusion-based, with diffusion-based priors being more computationally efficient and producing higher-quality samples. The decoder uses diffusion models to generate images conditioned on CLIP image embeddings, allowing for variations of an image that preserve its semantics and style while varying non-essential details. The joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. The model is evaluated on the MS-COCO dataset, achieving state-of-the-art results in terms of image quality and diversity. The model also demonstrates the ability to generate images with high fidelity and diversity, and is compared to other models such as DALL-E and GLIDE. The paper also discusses the limitations and risks of the model, including its difficulty in binding attributes to objects and its challenges in generating details in complex scenes. The model is shown to be effective in generating realistic images that capture the text prompts, and is compared to other models in terms of aesthetic quality and diversity. The paper concludes that unCLIP is a promising approach for text-conditional image generation, with the potential to be used in a variety of applications.