AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

6 May 2024 | Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu
AniTalker is a framework that generates animated talking faces from a single portrait and input audio, producing videos with natural movements. Unlike existing models that focus on verbal cues like lip synchronization, AniTalker uses a universal motion representation to capture complex facial dynamics, including subtle expressions and head movements. It employs self-supervised learning strategies to enhance motion depiction, such as reconstructing target video frames from source frames and minimizing mutual information between identity and motion encoders. This approach ensures motion representation is dynamic and identity-free, reducing the need for labeled data. An integration of a diffusion model with a variance adapter enables diverse and controllable facial animations. AniTalker's framework includes a motion generation system combining a diffusion-based motion generator with a variance adapter, allowing for diverse and controllable facial animations. The framework also features a universal motion encoder that decouples identity and motion information, enabling the creation of realistic and dynamic avatars. The model's self-supervised learning approach allows for robust motion representations without relying on labeled data. The framework's motion encoder is trained using a combination of metric learning and mutual information disentanglement to ensure motion representation is identity-free. The hierarchical aggregation layer (HAL) enhances the motion encoder's ability to understand motion variance across different scales. The model's performance is evaluated on three datasets, showing superior results in image structural metrics and face similarity. The framework demonstrates strong generalization capabilities across different identities and media. AniTalker sets a new benchmark for realistic and dynamic representation of digital human faces, with potential applications in entertainment, communication, and education. Limitations include challenges in complex backgrounds and extreme face angles, which may result in blurring. Future work will focus on improving temporal coherence and rendering effects.AniTalker is a framework that generates animated talking faces from a single portrait and input audio, producing videos with natural movements. Unlike existing models that focus on verbal cues like lip synchronization, AniTalker uses a universal motion representation to capture complex facial dynamics, including subtle expressions and head movements. It employs self-supervised learning strategies to enhance motion depiction, such as reconstructing target video frames from source frames and minimizing mutual information between identity and motion encoders. This approach ensures motion representation is dynamic and identity-free, reducing the need for labeled data. An integration of a diffusion model with a variance adapter enables diverse and controllable facial animations. AniTalker's framework includes a motion generation system combining a diffusion-based motion generator with a variance adapter, allowing for diverse and controllable facial animations. The framework also features a universal motion encoder that decouples identity and motion information, enabling the creation of realistic and dynamic avatars. The model's self-supervised learning approach allows for robust motion representations without relying on labeled data. The framework's motion encoder is trained using a combination of metric learning and mutual information disentanglement to ensure motion representation is identity-free. The hierarchical aggregation layer (HAL) enhances the motion encoder's ability to understand motion variance across different scales. The model's performance is evaluated on three datasets, showing superior results in image structural metrics and face similarity. The framework demonstrates strong generalization capabilities across different identities and media. AniTalker sets a new benchmark for realistic and dynamic representation of digital human faces, with potential applications in entertainment, communication, and education. Limitations include challenges in complex backgrounds and extreme face angles, which may result in blurring. Future work will focus on improving temporal coherence and rendering effects.
Reach us at info@study.space
[slides and audio] AniTalker%3A Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding