EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

29 Apr 2024 | Nikita Drobyshhev, Antoni Bigata Casademunt, Konstantinos Vougioukas, Zoe Landgraf, Stavros Petridis, Maja Pantic
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars This paper introduces EMOPortraits, a novel method for creating neural avatars with superior performance in image-driven, cross-identity emotion translation. The model enhances the ability to faithfully support intense, asymmetric face expressions, achieving state-of-the-art results in emotion transfer. It also incorporates a speech-driven mode, enabling audio-driven facial animation and allowing the source identity to be driven through diverse modalities, including visual signals, audio, or a blend of both. The research builds on the MegaPortraits model, which has demonstrated state-of-the-art results in cross-driving synthesis. However, the original model has limitations in expressing intense face motions. To address these limitations, the authors propose substantial changes in both training pipeline and model architecture, introducing EMOPortraits. They also propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions, filling the gap with absence of such data in existing datasets. The paper presents a comprehensive analysis of the MegaPortraits model, revealing that while it shows limited effectiveness in intense motion representing, it has significant potential for improvement through targeted architectural modifications, adjustments in the training approach, and the integration of the novel dataset. The authors also integrate speech driving into their model, achieving top-tier performance in audio-driven facial animation. They propose a novel loss function that helps achieve desirable results and generate plausible head rotations and blinks, enhancing the model's applicability across various tasks. The paper also presents the FEED dataset, a unique multi-view dataset that spans a broad spectrum of extreme facial expressions. This dataset addresses the scientific community's demand for high-quality multi-view facial expression videos outside the standard categories. The authors conducted extensive experiments to evaluate their model's performance, comparing it with other models in both image-driven and speech-driven modes. The results show that their model outperforms others in terms of FID scores and user preference for facial expression translation. The paper concludes that while the model has some limitations, such as not generating the avatar’s body or shoulders, and sometimes struggling with accurate expression translation, these challenges are crucial for future enhancements and remain central to their ongoing research efforts.EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars This paper introduces EMOPortraits, a novel method for creating neural avatars with superior performance in image-driven, cross-identity emotion translation. The model enhances the ability to faithfully support intense, asymmetric face expressions, achieving state-of-the-art results in emotion transfer. It also incorporates a speech-driven mode, enabling audio-driven facial animation and allowing the source identity to be driven through diverse modalities, including visual signals, audio, or a blend of both. The research builds on the MegaPortraits model, which has demonstrated state-of-the-art results in cross-driving synthesis. However, the original model has limitations in expressing intense face motions. To address these limitations, the authors propose substantial changes in both training pipeline and model architecture, introducing EMOPortraits. They also propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions, filling the gap with absence of such data in existing datasets. The paper presents a comprehensive analysis of the MegaPortraits model, revealing that while it shows limited effectiveness in intense motion representing, it has significant potential for improvement through targeted architectural modifications, adjustments in the training approach, and the integration of the novel dataset. The authors also integrate speech driving into their model, achieving top-tier performance in audio-driven facial animation. They propose a novel loss function that helps achieve desirable results and generate plausible head rotations and blinks, enhancing the model's applicability across various tasks. The paper also presents the FEED dataset, a unique multi-view dataset that spans a broad spectrum of extreme facial expressions. This dataset addresses the scientific community's demand for high-quality multi-view facial expression videos outside the standard categories. The authors conducted extensive experiments to evaluate their model's performance, comparing it with other models in both image-driven and speech-driven modes. The results show that their model outperforms others in terms of FID scores and user preference for facial expression translation. The paper concludes that while the model has some limitations, such as not generating the avatar’s body or shoulders, and sometimes struggling with accurate expression translation, these challenges are crucial for future enhancements and remain central to their ongoing research efforts.
Reach us at info@study.space