MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

20 Jun 2024 | Kim Sung-Bin1*, Lee Chae-Yeon2*, Gihun Son1*, Oh Hyun-Bin1, Janghoon Ju3, Suekyeong Nam3, Tae-Hyun Oh1,2,4
This paper introduces a novel task of generating 3D talking heads from multilingual speeches, addressing the lack of diverse datasets for multilingual capabilities in speech-driven 3D talking head generation. The authors collect the MultiTalk dataset, which includes over 420 hours of 2D talking videos in 20 languages, paired with pseudo 3D meshes and transcripts. They propose a two-stage training process for the MultiTalk model: first, learning a discrete codebook of context-rich facial motions using a vector quantized autoencoder (VQ-VAE), and then training a temporal autoregressive model to synthesize 3D faces conditioned on input speech and language embeddings. The model captures unique mouth movements associated with each language. The authors also introduce a new evaluation metric, Audio-Visual Lip Readability (AVLR), to assess lip-sync accuracy in multilingual settings. Experiments demonstrate that the MultiTalk model significantly enhances multilingual performance compared to existing models trained on English-only datasets. The paper includes quantitative and qualitative comparisons, user studies, and ablation studies to validate the effectiveness of the proposed approach.This paper introduces a novel task of generating 3D talking heads from multilingual speeches, addressing the lack of diverse datasets for multilingual capabilities in speech-driven 3D talking head generation. The authors collect the MultiTalk dataset, which includes over 420 hours of 2D talking videos in 20 languages, paired with pseudo 3D meshes and transcripts. They propose a two-stage training process for the MultiTalk model: first, learning a discrete codebook of context-rich facial motions using a vector quantized autoencoder (VQ-VAE), and then training a temporal autoregressive model to synthesize 3D faces conditioned on input speech and language embeddings. The model captures unique mouth movements associated with each language. The authors also introduce a new evaluation metric, Audio-Visual Lip Readability (AVLR), to assess lip-sync accuracy in multilingual settings. Experiments demonstrate that the MultiTalk model significantly enhances multilingual performance compared to existing models trained on English-only datasets. The paper includes quantitative and qualitative comparisons, user studies, and ablation studies to validate the effectiveness of the proposed approach.
Reach us at info@study.space