AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

26 Mar 2024 | Huawei Wei, Zejun Yang*, and Zhisheng Wang
AniPortrait is a novel framework for generating high-quality portrait animations driven by audio and a reference image. The framework consists of two stages: first, extracting 3D facial mesh and head pose from audio input using transformer-based models, which are then projected into a sequence of 2D facial landmarks. Second, a robust diffusion model combined with a motion module converts the landmark sequence into a temporally consistent and photorealistic portrait animation. The method achieves high facial naturalness, pose diversity, and visual quality, offering an enhanced perceptual experience. It also demonstrates flexibility and controllability, suitable for applications like facial motion editing and reenactment. The framework is trained using a two-step approach, with the 2D component and motion module trained separately. The method uses large-scale facial video datasets for training and leverages the robust generalization of diffusion models to generate realistic animations. However, it requires intermediate 3D representations, which can be costly to obtain. Future work aims to directly generate portrait videos from audio, improving generation results. The framework is implemented using wav2vec2.0 for audio processing and SD1.5 as the backbone for the diffusion model. The method is evaluated on various benchmarks, demonstrating its effectiveness in generating realistic and expressive portrait animations.AniPortrait is a novel framework for generating high-quality portrait animations driven by audio and a reference image. The framework consists of two stages: first, extracting 3D facial mesh and head pose from audio input using transformer-based models, which are then projected into a sequence of 2D facial landmarks. Second, a robust diffusion model combined with a motion module converts the landmark sequence into a temporally consistent and photorealistic portrait animation. The method achieves high facial naturalness, pose diversity, and visual quality, offering an enhanced perceptual experience. It also demonstrates flexibility and controllability, suitable for applications like facial motion editing and reenactment. The framework is trained using a two-step approach, with the 2D component and motion module trained separately. The method uses large-scale facial video datasets for training and leverages the robust generalization of diffusion models to generate realistic animations. However, it requires intermediate 3D representations, which can be costly to obtain. Future work aims to directly generate portrait videos from audio, improving generation results. The framework is implemented using wav2vec2.0 for audio processing and SD1.5 as the backbone for the diffusion model. The method is evaluated on various benchmarks, demonstrating its effectiveness in generating realistic and expressive portrait animations.
Reach us at info@study.space
Understanding AniPortrait%3A Audio-Driven Synthesis of Photorealistic Portrait Animation