StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

31 Mar 2021 | Or Patashnik†*, Zongze Wu†*, Eli Shechtman§, Daniel Cohen-Or†, Dani Lischinski‡
StyleCLIP is a novel approach that leverages the power of Contrastive Language-Image Pre-training (CLIP) models to enable text-driven manipulation of images generated by StyleGAN. The method aims to address the limitations of existing techniques, which often require manual examination or large annotated datasets for semantic control. StyleCLIP introduces three main techniques: 1. **Latent Optimization**: This method optimizes a latent code in StyleGAN's $\mathcal{W}+$ space using a CLIP-based loss, allowing for versatile but time-consuming manipulations. 2. **Latent Mapper**: A mapping network is trained to infer a manipulation step in $\mathcal{W}+$ space for a given input image and text prompt, providing faster and more stable text-based manipulation. 3. **Global Directions**: This method maps a text prompt into a global direction in StyleGAN's style space ($\mathcal{S}$), enabling fine-grained and disentangled manipulations without the need for additional supervision. The paper demonstrates the effectiveness of these methods through extensive experiments on various datasets, including human faces, animals, cars, and churches. The results show that StyleCLIP can achieve a wide range of semantic manipulations, from abstract to specific, and from extensive to fine-grained, with high quality and control over manipulation strength and disentanglement.StyleCLIP is a novel approach that leverages the power of Contrastive Language-Image Pre-training (CLIP) models to enable text-driven manipulation of images generated by StyleGAN. The method aims to address the limitations of existing techniques, which often require manual examination or large annotated datasets for semantic control. StyleCLIP introduces three main techniques: 1. **Latent Optimization**: This method optimizes a latent code in StyleGAN's $\mathcal{W}+$ space using a CLIP-based loss, allowing for versatile but time-consuming manipulations. 2. **Latent Mapper**: A mapping network is trained to infer a manipulation step in $\mathcal{W}+$ space for a given input image and text prompt, providing faster and more stable text-based manipulation. 3. **Global Directions**: This method maps a text prompt into a global direction in StyleGAN's style space ($\mathcal{S}$), enabling fine-grained and disentangled manipulations without the need for additional supervision. The paper demonstrates the effectiveness of these methods through extensive experiments on various datasets, including human faces, animals, cars, and churches. The results show that StyleCLIP can achieve a wide range of semantic manipulations, from abstract to specific, and from extensive to fine-grained, with high quality and control over manipulation strength and disentanglement.
Reach us at info@study.space