Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model

Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model

2024-03-12 | Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming Liu, Huaxia Li, Xu Tang, Yao Hu, and Haibo Zhao
Stable-Makeup is a novel diffusion-based method for real-world makeup transfer that can robustly transfer a wide range of makeup styles, from light to extremely heavy makeup. The method is based on a pre-trained diffusion model and utilizes a Detail-Preserving (D-P) makeup encoder to encode makeup details. It also employs content and structural control modules to preserve the content and structural information of the source image. With the aid of newly added makeup cross-attention layers in U-Net, the method accurately transfers detailed makeup to the corresponding position in the source image. After content-structure decoupling training, Stable-Makeup maintains the content and facial structure of the source image. The method has demonstrated strong robustness and generalizability, making it applicable to various tasks such as cross-domain makeup transfer, makeup-guided text-to-image generation, and more. To address the lack of diversity in existing makeup datasets, the authors propose an automatic data construction pipeline that employs large language models and generative models to edit real human face images and create paired before-and-after makeup images. The resulting dataset comprises 20,000 image pairs, covering a wide range of makeup styles from light to heavy makeup. This dataset has the potential to facilitate research in the makeup transfer field and enable the development of more robust and accurate models. Stable-Makeup consists of three key components: the Detail-Preserving Makeup Encoder, Makeup Cross-Attention Layers, and the Content and Structural Control Modules. The Detail-Preserving Makeup Encoder employs a multi-layer strategy to encode the makeup reference image into multi-scale and spatial-aware detail makeup embeddings. The content control module maintains pixel-level content consistency with the source image, while the structural control module introduces facial structure, improving the consistency between the generated image and the facial structure of the source image. To achieve semantic alignment between the intermediate features of the U-Net-encoded source image and the detail makeup embeddings, the U-Net architecture is extended by incorporating a makeup branch composed of cross-attention layers. Through the proposed content and structural decoupling training strategy, the facial structure of the source image is maintained. The method has been evaluated on various datasets and has demonstrated state-of-the-art results in terms of makeup transfer performance. The method has also been shown to be effective in various applications, including cross-domain makeup transfer, makeup-guided text-to-image generation, and video makeup transfer. The method has significant implications for the practical application of makeup transfer technology in various fields, such as cosmetics, entertainment, and fashion.Stable-Makeup is a novel diffusion-based method for real-world makeup transfer that can robustly transfer a wide range of makeup styles, from light to extremely heavy makeup. The method is based on a pre-trained diffusion model and utilizes a Detail-Preserving (D-P) makeup encoder to encode makeup details. It also employs content and structural control modules to preserve the content and structural information of the source image. With the aid of newly added makeup cross-attention layers in U-Net, the method accurately transfers detailed makeup to the corresponding position in the source image. After content-structure decoupling training, Stable-Makeup maintains the content and facial structure of the source image. The method has demonstrated strong robustness and generalizability, making it applicable to various tasks such as cross-domain makeup transfer, makeup-guided text-to-image generation, and more. To address the lack of diversity in existing makeup datasets, the authors propose an automatic data construction pipeline that employs large language models and generative models to edit real human face images and create paired before-and-after makeup images. The resulting dataset comprises 20,000 image pairs, covering a wide range of makeup styles from light to heavy makeup. This dataset has the potential to facilitate research in the makeup transfer field and enable the development of more robust and accurate models. Stable-Makeup consists of three key components: the Detail-Preserving Makeup Encoder, Makeup Cross-Attention Layers, and the Content and Structural Control Modules. The Detail-Preserving Makeup Encoder employs a multi-layer strategy to encode the makeup reference image into multi-scale and spatial-aware detail makeup embeddings. The content control module maintains pixel-level content consistency with the source image, while the structural control module introduces facial structure, improving the consistency between the generated image and the facial structure of the source image. To achieve semantic alignment between the intermediate features of the U-Net-encoded source image and the detail makeup embeddings, the U-Net architecture is extended by incorporating a makeup branch composed of cross-attention layers. Through the proposed content and structural decoupling training strategy, the facial structure of the source image is maintained. The method has been evaluated on various datasets and has demonstrated state-of-the-art results in terms of makeup transfer performance. The method has also been shown to be effective in various applications, including cross-domain makeup transfer, makeup-guided text-to-image generation, and video makeup transfer. The method has significant implications for the practical application of makeup transfer technology in various fields, such as cosmetics, entertainment, and fashion.
Reach us at info@study.space
[slides and audio] Stable-Makeup%3A When Real-World Makeup Transfer Meets Diffusion Model