13 Apr 2022 | Aditya Ramesh*, Prafulla Dhariwal*, Alex Nichol*, Casey Chu*, Mark Chen
The paper "Hierarchical Text-Conditional Image Generation with CLIP Latents" by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen from OpenAI proposes a two-stage model for text-conditional image generation. The first stage generates a CLIP image embedding from a text caption, and the second stage uses this embedding to condition a decoder that produces the final image. This approach leverages the robust representations learned by CLIP, which capture both semantics and style, to improve image diversity while maintaining photorealism and caption similarity. The decoders can produce variations of an image that preserve its semantics and style while varying non-essential details. The joint embedding space of CLIP also enables zero-shot language-guided image manipulations. The authors use diffusion models for the decoder and compare autoregressive and diffusion priors for the first stage, finding that diffusion priors are more computationally efficient and produce higher-quality samples. The paper includes experiments and human evaluations to demonstrate the effectiveness of the proposed method, showing that it achieves comparable quality to existing models like GLIDE but with greater diversity.The paper "Hierarchical Text-Conditional Image Generation with CLIP Latents" by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen from OpenAI proposes a two-stage model for text-conditional image generation. The first stage generates a CLIP image embedding from a text caption, and the second stage uses this embedding to condition a decoder that produces the final image. This approach leverages the robust representations learned by CLIP, which capture both semantics and style, to improve image diversity while maintaining photorealism and caption similarity. The decoders can produce variations of an image that preserve its semantics and style while varying non-essential details. The joint embedding space of CLIP also enables zero-shot language-guided image manipulations. The authors use diffusion models for the decoder and compare autoregressive and diffusion priors for the first stage, finding that diffusion priors are more computationally efficient and produce higher-quality samples. The paper includes experiments and human evaluations to demonstrate the effectiveness of the proposed method, showing that it achieves comparable quality to existing models like GLIDE but with greater diversity.