MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

20 Jun 2024 | Kim Sung-Bin*, Lee Chae-Yeon*, Gihun Son*, Oh Hyun-Bin¹, Janghoon Ju³, Suekyeong Nam³, Tae-Hyun Oh¹,²,⁴
This paper introduces a novel task for generating 3D talking heads from multilingual speeches, addressing the challenge of generating accurate lip-syncs across diverse languages. The authors propose a new multilingual 2D video dataset, MultiTalk, containing over 420 hours of talking videos in 20 languages, paired with pseudo 3D mesh data. They also introduce a multilingual enhanced model, MultiTalk, which incorporates language-specific style embeddings to capture unique mouth movements for each language. The model is trained on the MultiTalk dataset and demonstrates improved multilingual performance compared to existing models. The authors propose a new evaluation metric, Audio-Visual Lip Readability (AVLR), which assesses lip-sync accuracy using a pre-trained Audio-Visual Speech Recognition (AVSR) model. Experiments show that MultiTalk performs favorably across diverse languages compared to previous works. The model is trained in two stages: first, learning a discrete codebook of facial motions, and second, training a temporal autoregressive model to synthesize 3D faces based on input speech and language embeddings. The model is evaluated using both quantitative metrics (LVE and AVLR) and a user study, where participants were asked to choose between two 3D face videos based on lip synchronization and realism. The results show that MultiTalk is preferred by users, particularly for realism and lip synchronization. The authors also conduct ablation studies to validate their design choices, showing that incorporating language style embeddings and multilingual speech encoders improves performance. The paper concludes that the proposed dataset and model significantly enhance the multilingual capabilities of 3D talking head generation. The authors note that their approach has broader applications beyond their immediate task, such as in audio-visual speech recognition and 2D talking heads. The work highlights the importance of diverse datasets in improving the performance of multilingual 3D talking head generation.This paper introduces a novel task for generating 3D talking heads from multilingual speeches, addressing the challenge of generating accurate lip-syncs across diverse languages. The authors propose a new multilingual 2D video dataset, MultiTalk, containing over 420 hours of talking videos in 20 languages, paired with pseudo 3D mesh data. They also introduce a multilingual enhanced model, MultiTalk, which incorporates language-specific style embeddings to capture unique mouth movements for each language. The model is trained on the MultiTalk dataset and demonstrates improved multilingual performance compared to existing models. The authors propose a new evaluation metric, Audio-Visual Lip Readability (AVLR), which assesses lip-sync accuracy using a pre-trained Audio-Visual Speech Recognition (AVSR) model. Experiments show that MultiTalk performs favorably across diverse languages compared to previous works. The model is trained in two stages: first, learning a discrete codebook of facial motions, and second, training a temporal autoregressive model to synthesize 3D faces based on input speech and language embeddings. The model is evaluated using both quantitative metrics (LVE and AVLR) and a user study, where participants were asked to choose between two 3D face videos based on lip synchronization and realism. The results show that MultiTalk is preferred by users, particularly for realism and lip synchronization. The authors also conduct ablation studies to validate their design choices, showing that incorporating language style embeddings and multilingual speech encoders improves performance. The paper concludes that the proposed dataset and model significantly enhance the multilingual capabilities of 3D talking head generation. The authors note that their approach has broader applications beyond their immediate task, such as in audio-visual speech recognition and 2D talking heads. The work highlights the importance of diverse datasets in improving the performance of multilingual 3D talking head generation.
Reach us at info@study.space
[slides] MultiTalk%3A Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset | StudySpace