25 Jun 2024 | Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Jie Zhang
ID-Animator is a zero-shot human video generation method that can produce personalized videos from a single facial image without further training. It leverages pre-trained text-to-video diffusion models with a lightweight face adapter to encode identity-relevant embeddings from facial latent queries. The method introduces an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a facial image pool. A random reference training strategy is further devised to precisely capture ID-relevant embeddings with an ID-preserving loss, improving the fidelity and generalization of the model for ID-specific video generation. The method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications where identity preservation is desired. The method outperforms previous models in generating personalized human videos, demonstrating superior identity fidelity and generalization. It can recontextualize various elements in a reference image, including human hair, clothing, background, actions, age, and gender. The method also demonstrates the ability to blend distinct identities and create identity-specific videos. It is compatible with ControlNet and community models, enabling highly customized video generation. The method is evaluated using quantitative metrics such as CLIP-I, Dover score, Motion score, and Dynamic degree, showing superior performance compared to existing methods. The method is also tested on a variety of prompts, demonstrating its ability to generate high-quality, identity-preserving videos. The method is effective in reducing the influence of ID-irrelevant features in reference images, enhancing the identity fidelity of generated videos. The method is efficient and lightweight, with a single A100 GPU training within a day and zero-shot inference after training. The method is compatible with various T2V generation backbones and has been released with codes and checkpoints.ID-Animator is a zero-shot human video generation method that can produce personalized videos from a single facial image without further training. It leverages pre-trained text-to-video diffusion models with a lightweight face adapter to encode identity-relevant embeddings from facial latent queries. The method introduces an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a facial image pool. A random reference training strategy is further devised to precisely capture ID-relevant embeddings with an ID-preserving loss, improving the fidelity and generalization of the model for ID-specific video generation. The method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications where identity preservation is desired. The method outperforms previous models in generating personalized human videos, demonstrating superior identity fidelity and generalization. It can recontextualize various elements in a reference image, including human hair, clothing, background, actions, age, and gender. The method also demonstrates the ability to blend distinct identities and create identity-specific videos. It is compatible with ControlNet and community models, enabling highly customized video generation. The method is evaluated using quantitative metrics such as CLIP-I, Dover score, Motion score, and Dynamic degree, showing superior performance compared to existing methods. The method is also tested on a variety of prompts, demonstrating its ability to generate high-quality, identity-preserving videos. The method is effective in reducing the influence of ID-irrelevant features in reference images, enhancing the identity fidelity of generated videos. The method is efficient and lightweight, with a single A100 GPU training within a day and zero-shot inference after training. The method is compatible with various T2V generation backbones and has been released with codes and checkpoints.