EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

2 Apr 2024 | Shuai Tan, Bin Ji, Mengxiao Bi, Ye Pan
EDTalk is a novel framework for efficient disentanglement in emotional talking head synthesis, enabling precise control over mouth shape, head pose, and emotional expression. The framework decomposes facial dynamics into three distinct latent spaces: mouth, pose, and expression, each represented by learnable bases. These bases are orthogonal to ensure independence and efficient training. The framework also introduces an Audio-to-Motion module for audio-driven synthesis, leveraging audio features to predict weights for each latent space. The method achieves complete disentanglement through an efficient training strategy, allowing for simultaneous handling of audio and video inputs. Experiments demonstrate that EDTalk outperforms existing methods in both quantitative and qualitative evaluations, achieving high-quality, realistic talking head generation with fine-grained control. The framework is efficient, requiring less training time and computational resources compared to other methods. It also enables the generation of talking heads from a single audio input, without relying on explicit image or video references. The method is evaluated on multiple datasets, showing superior performance in terms of video quality, audio-visual synchronization, and emotional accuracy. The results highlight the effectiveness of EDTalk in achieving disentangled and precise control over diverse facial motions.EDTalk is a novel framework for efficient disentanglement in emotional talking head synthesis, enabling precise control over mouth shape, head pose, and emotional expression. The framework decomposes facial dynamics into three distinct latent spaces: mouth, pose, and expression, each represented by learnable bases. These bases are orthogonal to ensure independence and efficient training. The framework also introduces an Audio-to-Motion module for audio-driven synthesis, leveraging audio features to predict weights for each latent space. The method achieves complete disentanglement through an efficient training strategy, allowing for simultaneous handling of audio and video inputs. Experiments demonstrate that EDTalk outperforms existing methods in both quantitative and qualitative evaluations, achieving high-quality, realistic talking head generation with fine-grained control. The framework is efficient, requiring less training time and computational resources compared to other methods. It also enables the generation of talking heads from a single audio input, without relying on explicit image or video references. The method is evaluated on multiple datasets, showing superior performance in terms of video quality, audio-visual synchronization, and emotional accuracy. The results highlight the effectiveness of EDTalk in achieving disentangled and precise control over diverse facial motions.
Reach us at info@study.space
Understanding EDTalk%3A Efficient Disentanglement for Emotional Talking Head Synthesis