[slides] Media2Face%3A Co-speech Facial Animation Generation With Multi-Modality Guidance

The paper "Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance" addresses the challenge of generating realistic and expressive 3D facial animations from speech. The authors introduce a trilogy of methods: Generalized Neural Parametric Facial Asset (GNPFA), the M2F-D dataset, and Media2Face, a diffusion model for co-speech facial animation generation. 1. **Generalized Neural Parametric Facial Asset (GNPFA)**: GNPFA is a variational auto-encoder that maps facial geometry and images to a latent space, decoupling expressions and identities. It is trained on a large dataset of 4D facial scans, including high-resolution images and refined face geometries, to capture nuanced facial expressions and head poses. 2. **M2F-D Dataset**: This dataset is created by extracting high-quality facial expressions and head poses from diverse videos using GNPFA. It includes a wide range of emotions, styles, and languages, providing a rich source of annotated data for training. 3. **Media2Face**: A diffusion model trained in the latent space of GNPFA, capable of generating high-fidelity lip-syncing and nuanced facial animations. It integrates rich multi-modal inputs (audio, text, and image) to control facial expressions and head poses, achieving both high realism and flexibility in style adaptation. The paper demonstrates the effectiveness of Media2Face through extensive experiments and user studies, showing superior performance in lip synchronization, expression stylization, and head movement synchronization compared to existing methods. The system also supports various applications, such as generating realistic facial animations from diverse audio sources and editing animations based on text and image prompts.The paper "Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance" addresses the challenge of generating realistic and expressive 3D facial animations from speech. The authors introduce a trilogy of methods: Generalized Neural Parametric Facial Asset (GNPFA), the M2F-D dataset, and Media2Face, a diffusion model for co-speech facial animation generation. 1. **Generalized Neural Parametric Facial Asset (GNPFA)**: GNPFA is a variational auto-encoder that maps facial geometry and images to a latent space, decoupling expressions and identities. It is trained on a large dataset of 4D facial scans, including high-resolution images and refined face geometries, to capture nuanced facial expressions and head poses. 2. **M2F-D Dataset**: This dataset is created by extracting high-quality facial expressions and head poses from diverse videos using GNPFA. It includes a wide range of emotions, styles, and languages, providing a rich source of annotated data for training. 3. **Media2Face**: A diffusion model trained in the latent space of GNPFA, capable of generating high-fidelity lip-syncing and nuanced facial animations. It integrates rich multi-modal inputs (audio, text, and image) to control facial expressions and head poses, achieving both high realism and flexibility in style adaptation. The paper demonstrates the effectiveness of Media2Face through extensive experiments and user studies, showing superior performance in lip synchronization, expression stylization, and head movement synchronization compared to existing methods. The system also supports various applications, such as generating realistic facial animations from diverse audio sources and editing animations based on text and image prompts.

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

30 Jan 2024 | Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu