Human Image Personalization with High-fidelity Identity Preservation

Human Image Personalization with High-fidelity Identity Preservation

25 Mar 2024 | Shilong Zhang¹, Lianghua Huang², Xi Chen¹, Yifei Zhang², Zhi-Fan Wu², Yutong Feng², Wei Wang², Yujun Shen³, Yu Liu², and Ping Luo¹
FlashFace is a practical tool that enables users to personalize their own photos on the fly by providing one or a few reference face images and a text prompt. The method distinguishes itself from existing human photo customization methods through higher-fidelity identity preservation and better instruction following. Two key design choices contribute to this: first, encoding the face identity into a series of feature maps rather than a single image token, which allows the model to retain more details of the reference faces, such as scars, tattoos, and face shape. Second, a disentangled integration strategy is introduced to balance text and image guidance during text-to-image generation, reducing conflicts between the reference faces and text prompts, such as personalizing an adult into a "child" or an "elder." Extensive experiments demonstrate the effectiveness of FlashFace in various applications, including human image personalization, face swapping under language prompts, and making virtual characters into real people. The method achieves high fidelity in preserving facial details and accurately follows text prompts, even when there is a conflict between the text and the reference images. The proposed method also introduces a novel data construction pipeline that ensures facial variation between the reference face and the generated face, helping the model learn from text prompts rather than directly copying the reference face. The paper presents a comprehensive framework for human image personalization, including a data collection pipeline that incorporates individual ID annotations for each image, allowing the sampling of multiple images of the same individual during training. The framework is based on the widely used SD-V1.5 model, with a U-Net for denoising and a CLIP model for language encoding. A Face ReferenceNet is introduced to extract detailed facial features that retain spatial shape and are incorporated into the network using additional reference attention layers. The method also provides a way to control the reference strength during inference, allowing users to adjust the balance between the language prompt and references in case of conflicts. The framework also incorporates classifier-free guidance to enhance the ability to follow text prompts and preserve facial details. The experiments show that FlashFace outperforms previous methods in terms of face similarity and identity preservation, demonstrating its effectiveness in various downstream tasks. The method's ability to maintain identity fidelity and follow text prompts opens up new possibilities for human image personalization.FlashFace is a practical tool that enables users to personalize their own photos on the fly by providing one or a few reference face images and a text prompt. The method distinguishes itself from existing human photo customization methods through higher-fidelity identity preservation and better instruction following. Two key design choices contribute to this: first, encoding the face identity into a series of feature maps rather than a single image token, which allows the model to retain more details of the reference faces, such as scars, tattoos, and face shape. Second, a disentangled integration strategy is introduced to balance text and image guidance during text-to-image generation, reducing conflicts between the reference faces and text prompts, such as personalizing an adult into a "child" or an "elder." Extensive experiments demonstrate the effectiveness of FlashFace in various applications, including human image personalization, face swapping under language prompts, and making virtual characters into real people. The method achieves high fidelity in preserving facial details and accurately follows text prompts, even when there is a conflict between the text and the reference images. The proposed method also introduces a novel data construction pipeline that ensures facial variation between the reference face and the generated face, helping the model learn from text prompts rather than directly copying the reference face. The paper presents a comprehensive framework for human image personalization, including a data collection pipeline that incorporates individual ID annotations for each image, allowing the sampling of multiple images of the same individual during training. The framework is based on the widely used SD-V1.5 model, with a U-Net for denoising and a CLIP model for language encoding. A Face ReferenceNet is introduced to extract detailed facial features that retain spatial shape and are incorporated into the network using additional reference attention layers. The method also provides a way to control the reference strength during inference, allowing users to adjust the balance between the language prompt and references in case of conflicts. The framework also incorporates classifier-free guidance to enhance the ability to follow text prompts and preserve facial details. The experiments show that FlashFace outperforms previous methods in terms of face similarity and identity preservation, demonstrating its effectiveness in various downstream tasks. The method's ability to maintain identity fidelity and follow text prompts opens up new possibilities for human image personalization.
Reach us at info@study.space