6 May 2024 | Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu
AniTalker is an innovative framework that transforms a single static portrait and input audio into animated talking videos with naturally flowing movements. The framework employs a universal motion representation to capture a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and developing an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. AniTalker not only demonstrates its capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. The framework includes a motion generation system that combines a diffusion-based motion generator with a variance adapter, allowing for the production of diverse and controllable facial animations. Extensive evaluations affirm the framework's contribution to enhancing the realism and dynamism of digital human representations while preserving identity. The framework's universal motion encoder is designed to grasp the intricacies of facial dynamics, using self-supervised learning to mitigate the reliance on labeled data. The motion encoder learns robust motion representations through dual levels: understanding motion dynamics through the transformation of a source image into a target image, capturing a spectrum of facial movements, and using identity labels within the dataset to jointly optimize an identity recognition network in a self-supervised manner, further aiming to disentangle identity from motion information through mutual information minimization. The framework also integrates a diffusion model and a variance adapter to enable varied generation and manipulation of facial animations. The results show that AniTalker can produce diverse and controllable talking faces, demonstrating its ability to capture and represent a wide array of human facial movements. The framework's ability to generalize to other images with facial structures, such as cartoons, sculptures, reliefs, and game characters, underscores its excellent scalability. The framework's complete decoupling of identity and motion ensures that it grasps the intrinsic nature of facial movements, enhancing its generalization capability. AniTalker sets a new benchmark for the realistic and dynamic representation of digital human faces, promising broad applications in entertainment, communication, and education. However, the framework still faces challenges, such as inconsistencies in complex backgrounds and blurring at the edges in extreme cases. Future work will focus on improving the temporal coherence and rendering effects of the rendering module.AniTalker is an innovative framework that transforms a single static portrait and input audio into animated talking videos with naturally flowing movements. The framework employs a universal motion representation to capture a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and developing an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. AniTalker not only demonstrates its capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. The framework includes a motion generation system that combines a diffusion-based motion generator with a variance adapter, allowing for the production of diverse and controllable facial animations. Extensive evaluations affirm the framework's contribution to enhancing the realism and dynamism of digital human representations while preserving identity. The framework's universal motion encoder is designed to grasp the intricacies of facial dynamics, using self-supervised learning to mitigate the reliance on labeled data. The motion encoder learns robust motion representations through dual levels: understanding motion dynamics through the transformation of a source image into a target image, capturing a spectrum of facial movements, and using identity labels within the dataset to jointly optimize an identity recognition network in a self-supervised manner, further aiming to disentangle identity from motion information through mutual information minimization. The framework also integrates a diffusion model and a variance adapter to enable varied generation and manipulation of facial animations. The results show that AniTalker can produce diverse and controllable talking faces, demonstrating its ability to capture and represent a wide array of human facial movements. The framework's ability to generalize to other images with facial structures, such as cartoons, sculptures, reliefs, and game characters, underscores its excellent scalability. The framework's complete decoupling of identity and motion ensures that it grasps the intrinsic nature of facial movements, enhancing its generalization capability. AniTalker sets a new benchmark for the realistic and dynamic representation of digital human faces, promising broad applications in entertainment, communication, and education. However, the framework still faces challenges, such as inconsistencies in complex backgrounds and blurring at the edges in extreme cases. Future work will focus on improving the temporal coherence and rendering effects of the rendering module.