2024-01-30 | Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu
Media2Face is a system that generates co-speech facial animations with multi-modal guidance. The system addresses the challenge of generating realistic and flexible facial animations from speech by introducing Generalized Neural Parametric Facial Asset (GNPFA), a variational auto-encoder that maps facial geometry and images to a latent space, decoupling expressions and identities. GNPFA is used to extract high-quality expressions and accurate head poses from a large array of videos, resulting in the creation of the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Media2Face is a diffusion model in the GNPFA latent space for co-speech facial animation generation, accepting rich multi-modal guidance from audio, text, and image. Extensive experiments demonstrate that the model achieves high fidelity in facial animation synthesis and broadens the scope of expressiveness and style adaptability in 3D facial animation. The system also enables flexible conditioning and disentangled controls from diverse modalities like speech, style, or emotion. Media2Face integrates diverse media inputs (audio, image, and text) to drive vivid facial animations including head poses. The system is trained on a wide array of multi-identity 4D facial scanning data, including high-resolution images and artists' refined face geometries, dubbed Range of Motion (RoM) data. The system is capable of generating high-quality lip-sync with speech and expressing nuanced human emotions contained in text, images, and even music. The system also enables fine-grained control of generation through keyframe editing and text/image guidance. The system is evaluated on multiple audio-driven animations and style-based generation results based on text and image prompts. The system is compared with several state-of-the-art facial animation methods and shows superior performance in terms of lip accuracy, facial expression stylization, and the synthesis of rhythmic head movements. The system is also evaluated through user studies, where participants prefer Media2Face over other methods. The system is capable of generating personalized and stylized co-speech facial animations and head poses. The system is trained on a large collection of online video facial data with abundant audio and text labels, and uses GNPFA to extract exact facial expressions and accurate head poses. This allows the system to avoid tedious annotations and easily augment the limited 4D facial animation dataset. The system is capable of generating a wide array of 4D data, including different content, styles, emotions, and languages. The system is capable of generating a wide array of 4D data, including different content, styles, emotions, and languages. The system is capable of generating a wide array of 4D data, including different content, styles, emotions, and languages.Media2Face is a system that generates co-speech facial animations with multi-modal guidance. The system addresses the challenge of generating realistic and flexible facial animations from speech by introducing Generalized Neural Parametric Facial Asset (GNPFA), a variational auto-encoder that maps facial geometry and images to a latent space, decoupling expressions and identities. GNPFA is used to extract high-quality expressions and accurate head poses from a large array of videos, resulting in the creation of the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Media2Face is a diffusion model in the GNPFA latent space for co-speech facial animation generation, accepting rich multi-modal guidance from audio, text, and image. Extensive experiments demonstrate that the model achieves high fidelity in facial animation synthesis and broadens the scope of expressiveness and style adaptability in 3D facial animation. The system also enables flexible conditioning and disentangled controls from diverse modalities like speech, style, or emotion. Media2Face integrates diverse media inputs (audio, image, and text) to drive vivid facial animations including head poses. The system is trained on a wide array of multi-identity 4D facial scanning data, including high-resolution images and artists' refined face geometries, dubbed Range of Motion (RoM) data. The system is capable of generating high-quality lip-sync with speech and expressing nuanced human emotions contained in text, images, and even music. The system also enables fine-grained control of generation through keyframe editing and text/image guidance. The system is evaluated on multiple audio-driven animations and style-based generation results based on text and image prompts. The system is compared with several state-of-the-art facial animation methods and shows superior performance in terms of lip accuracy, facial expression stylization, and the synthesis of rhythmic head movements. The system is also evaluated through user studies, where participants prefer Media2Face over other methods. The system is capable of generating personalized and stylized co-speech facial animations and head poses. The system is trained on a large collection of online video facial data with abundant audio and text labels, and uses GNPFA to extract exact facial expressions and accurate head poses. This allows the system to avoid tedious annotations and easily augment the limited 4D facial animation dataset. The system is capable of generating a wide array of 4D data, including different content, styles, emotions, and languages. The system is capable of generating a wide array of 4D data, including different content, styles, emotions, and languages. The system is capable of generating a wide array of 4D data, including different content, styles, emotions, and languages.