EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions

12 Jul 2024 | Zhiyuan Chen*, Ant Group, juzhen.czy@antgroup.com, Jiajiong Cao*, Ant Group, jiajiong.caojiajio@antgroup.com, Zhiquan Chen, Ant Group, zhiquan.zhiquanche@antgroup.com, Yuming Li, Ant Group, luoque.lym@antgroup.com, Chenguang Ma, Ant Group, chenguang.mcg@antgroup.com
**Project Page:** <https://badtobest.github.io/echomimic.html> **Abstract:** The paper introduces EchoMimic, a novel approach for generating lifelike and dynamic portrait animations using audio and facial landmarks. Traditional methods, either audio-driven or landmark-driven, have limitations such as instability or unnatural outcomes. EchoMimic addresses these issues by concurrently training on both audio signals and facial landmarks. It can generate portrait videos using individual audio, landmarks, or a combination of both. Comprehensive evaluations across various datasets show superior performance in both quantitative and qualitative metrics. **Introduction:** The field of portrait animation has seen significant advancements with the use of diffusion models, which enable the creation of hyper-realistic images and videos. However, challenges remain in synchronizing lip movements, facial expressions, and head poses with audio inputs. EchoMimic aims to overcome these challenges by integrating audio and landmark information. **Related Works:** Diffusion models have shown remarkable versatility in multimedia tasks, including image and video generation. Previous works in portrait animation, such as Wav2Lip and AniPortrait, have made strides in generating realistic animations, but they often rely on either audio or pose inputs separately. **Method:** EchoMimic is based on the Stable Diffusion framework, incorporating a Denoising U-Net architecture with specialized modules for reference images, landmarks, and audio inputs. The model uses a two-stage training strategy, first learning the relationship between image-audio and image-pose, and then integrating temporal dynamics for video generation. **Experiments:** Quantitative evaluations on datasets like HDTF, CelebV-HQ, and a collected dataset show that EchoMimic outperforms existing methods in terms of visual quality, temporal coherence, and lip synchronization. Qualitative results demonstrate the adaptability and robustness of the approach across different portrait styles and audio inputs. **Ablation Study:** Ablation studies validate the effectiveness of the proposed motion synchronization method and the control of facial expressions using selected landmarks. **Limitations and Future Work:** While EchoMimic shows promising results, future work could focus on updating video processing frameworks and accelerating the generation process for real-time applications. **Conclusions:** EchoMimic is a novel and effective approach for generating high-quality portrait animations, addressing key challenges in the field and showing significant potential for advancing multimedia experiences.**Project Page:** <https://badtobest.github.io/echomimic.html> **Abstract:** The paper introduces EchoMimic, a novel approach for generating lifelike and dynamic portrait animations using audio and facial landmarks. Traditional methods, either audio-driven or landmark-driven, have limitations such as instability or unnatural outcomes. EchoMimic addresses these issues by concurrently training on both audio signals and facial landmarks. It can generate portrait videos using individual audio, landmarks, or a combination of both. Comprehensive evaluations across various datasets show superior performance in both quantitative and qualitative metrics. **Introduction:** The field of portrait animation has seen significant advancements with the use of diffusion models, which enable the creation of hyper-realistic images and videos. However, challenges remain in synchronizing lip movements, facial expressions, and head poses with audio inputs. EchoMimic aims to overcome these challenges by integrating audio and landmark information. **Related Works:** Diffusion models have shown remarkable versatility in multimedia tasks, including image and video generation. Previous works in portrait animation, such as Wav2Lip and AniPortrait, have made strides in generating realistic animations, but they often rely on either audio or pose inputs separately. **Method:** EchoMimic is based on the Stable Diffusion framework, incorporating a Denoising U-Net architecture with specialized modules for reference images, landmarks, and audio inputs. The model uses a two-stage training strategy, first learning the relationship between image-audio and image-pose, and then integrating temporal dynamics for video generation. **Experiments:** Quantitative evaluations on datasets like HDTF, CelebV-HQ, and a collected dataset show that EchoMimic outperforms existing methods in terms of visual quality, temporal coherence, and lip synchronization. Qualitative results demonstrate the adaptability and robustness of the approach across different portrait styles and audio inputs. **Ablation Study:** Ablation studies validate the effectiveness of the proposed motion synchronization method and the control of facial expressions using selected landmarks. **Limitations and Future Work:** While EchoMimic shows promising results, future work could focus on updating video processing frameworks and accelerating the generation process for real-time applications. **Conclusions:** EchoMimic is a novel and effective approach for generating high-quality portrait animations, addressing key challenges in the field and showing significant potential for advancing multimedia experiences.
Reach us at info@study.space