DreamIdentity is a method for efficient face-identity preserved image generation. The method aims to enhance the editability of pre-trained text-to-image (T2I) models to preserve face identity while following text prompts. The key idea is to learn edit-friendly and accurate face-identity representations in the word embedding space. The method proposes self-augmented editability learning to enhance the editability for projected embedding, which is achieved by constructing paired generated celebrity's face and edited celebrity images for training. Additionally, a novel dedicated face-identity encoder is designed to learn an accurate representation of human faces, which applies multi-scale ID-aware features followed by a multi-embedding projector to generate the pseudo words in the text embedding space directly. Extensive experiments show that the method can generate more text-coherent and ID-preserved images with negligible time overhead compared to the standard T2I generation process.
The method introduces a novel encoder-based approach, dubbed DreamIdentity, with edit-friendly and accurate representations in the projected word embedding space to enhance the editability for efficient face-identity preserved image generation. Specifically, self-augmented editability learning is devised to take the editing task into the training phase. It exploits the T2I model itself to construct a self-augmented dataset by generating celebrity faces along with a wide range of target-edited celebrity images. A dedicated Multi-word Multi-scale ID encoder, named $ M^{2} $ ID encoder, is designed for identity encoding. The encoder is based on a ViT-based network and pre-trained on a large-scale face dataset. Multi-word embeddings are projected from its multi-scale coarse-to-fine features. The $ M^{2} $ ID encoder is then trained with a combination of the self-augmented dataset and the typical face-only dataset to learn the edit-friendly identity and accurate word embedding $ S^{*} $.
The main contributions of the work are: (1) Conceptually, we point out that current encoder-based methods fail for high editability due to their reconstruction-biased and inaccurate word embedding representation. (2) Technically, for edit-friendly representation, we introduce self-augmented editability learning to generate a high-quality editing dataset by the foundation T2I model itself. For accurate representation, we propose a delicate $ M^{2} $ ID Encoder with multi-scale feature and multi-embedding projection. (3) Experimentally, extensive experiments demonstrate the superiority of our method, which can efficiently achieve flexible text-guided generation while preserving high ID-similarity.DreamIdentity is a method for efficient face-identity preserved image generation. The method aims to enhance the editability of pre-trained text-to-image (T2I) models to preserve face identity while following text prompts. The key idea is to learn edit-friendly and accurate face-identity representations in the word embedding space. The method proposes self-augmented editability learning to enhance the editability for projected embedding, which is achieved by constructing paired generated celebrity's face and edited celebrity images for training. Additionally, a novel dedicated face-identity encoder is designed to learn an accurate representation of human faces, which applies multi-scale ID-aware features followed by a multi-embedding projector to generate the pseudo words in the text embedding space directly. Extensive experiments show that the method can generate more text-coherent and ID-preserved images with negligible time overhead compared to the standard T2I generation process.
The method introduces a novel encoder-based approach, dubbed DreamIdentity, with edit-friendly and accurate representations in the projected word embedding space to enhance the editability for efficient face-identity preserved image generation. Specifically, self-augmented editability learning is devised to take the editing task into the training phase. It exploits the T2I model itself to construct a self-augmented dataset by generating celebrity faces along with a wide range of target-edited celebrity images. A dedicated Multi-word Multi-scale ID encoder, named $ M^{2} $ ID encoder, is designed for identity encoding. The encoder is based on a ViT-based network and pre-trained on a large-scale face dataset. Multi-word embeddings are projected from its multi-scale coarse-to-fine features. The $ M^{2} $ ID encoder is then trained with a combination of the self-augmented dataset and the typical face-only dataset to learn the edit-friendly identity and accurate word embedding $ S^{*} $.
The main contributions of the work are: (1) Conceptually, we point out that current encoder-based methods fail for high editability due to their reconstruction-biased and inaccurate word embedding representation. (2) Technically, for edit-friendly representation, we introduce self-augmented editability learning to generate a high-quality editing dataset by the foundation T2I model itself. For accurate representation, we propose a delicate $ M^{2} $ ID Encoder with multi-scale feature and multi-embedding projection. (3) Experimentally, extensive experiments demonstrate the superiority of our method, which can efficiently achieve flexible text-guided generation while preserving high ID-similarity.