2 Apr 2024 | Shuai Tan1, Bin Ji1, Mengxiao Bi2, Ye Pan1*
The paper introduces EDTalk, a novel framework for efficient disentanglement in talking head synthesis, enabling precise control over mouth shape, head pose, and emotional expression. The core idea is to decompose facial dynamics into distinct latent spaces for each component (mouth, pose, expression) using orthogonal bases stored in dedicated banks. This approach ensures that each space operates independently without mutual interference. The framework employs an efficient training strategy that avoids the need for external or prior structures, significantly reducing training time and computational resources. Additionally, EDTalk includes an Audio-to-Motion module to generate audio-driven talking head videos, incorporating probabilistic poses and semantically-aware expressions. Experiments demonstrate EDTalk's superior performance in both quantitative and qualitative evaluations, outperforming state-of-the-art methods in terms of video quality, audio-visual synchronization, and emotional accuracy. The method is highly efficient, requiring less training data and computational resources compared to existing approaches.The paper introduces EDTalk, a novel framework for efficient disentanglement in talking head synthesis, enabling precise control over mouth shape, head pose, and emotional expression. The core idea is to decompose facial dynamics into distinct latent spaces for each component (mouth, pose, expression) using orthogonal bases stored in dedicated banks. This approach ensures that each space operates independently without mutual interference. The framework employs an efficient training strategy that avoids the need for external or prior structures, significantly reducing training time and computational resources. Additionally, EDTalk includes an Audio-to-Motion module to generate audio-driven talking head videos, incorporating probabilistic poses and semantically-aware expressions. Experiments demonstrate EDTalk's superior performance in both quantitative and qualitative evaluations, outperforming state-of-the-art methods in terms of video quality, audio-visual synchronization, and emotional accuracy. The method is highly efficient, requiring less training data and computational resources compared to existing approaches.