EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

29 Apr 2024 | Nikita Drobyshev Antoni Bigata Casademunt Konstantinos Vougioukas Zoe Landgraf Stavros Petridis Maja Pantic
EMOPortraits is a novel method for creating neural avatars that significantly enhances the ability to transfer intense and asymmetric facial expressions, outperforming previous methods in both metrics and quality. The method introduces substantial changes to the training pipeline and model architecture, addressing the limitations of the MegaPortraits model in expressing intense face motions. EMOPortraits also incorporates a speech-driven mode, enabling the animation of avatars using audio signals, making it versatile for various applications such as virtual assistance and mixed reality. Additionally, the authors propose a new multi-view video dataset, FEED, which captures a wide range of intense and asymmetric facial expressions, filling a gap in existing datasets. The dataset includes 520 multi-view videos of 23 subjects, captured with 3 cameras, and covers a broad spectrum of facial expressions, including strong asymmetric movements, tongue and cheeks movements, winks, head rotations, and eye movements. The evaluation of EMOPortraits shows superior performance in image-driven and speech-driven modes, with notable improvements in FID scores, user preference metrics, and realistic facial dynamics. The method's limitations, such as the lack of body or shoulder generation and challenges with extensive head rotation, are also discussed, highlighting areas for future research.EMOPortraits is a novel method for creating neural avatars that significantly enhances the ability to transfer intense and asymmetric facial expressions, outperforming previous methods in both metrics and quality. The method introduces substantial changes to the training pipeline and model architecture, addressing the limitations of the MegaPortraits model in expressing intense face motions. EMOPortraits also incorporates a speech-driven mode, enabling the animation of avatars using audio signals, making it versatile for various applications such as virtual assistance and mixed reality. Additionally, the authors propose a new multi-view video dataset, FEED, which captures a wide range of intense and asymmetric facial expressions, filling a gap in existing datasets. The dataset includes 520 multi-view videos of 23 subjects, captured with 3 cameras, and covers a broad spectrum of facial expressions, including strong asymmetric movements, tongue and cheeks movements, winks, head rotations, and eye movements. The evaluation of EMOPortraits shows superior performance in image-driven and speech-driven modes, with notable improvements in FID scores, user preference metrics, and realistic facial dynamics. The method's limitations, such as the lack of body or shoulder generation and challenges with extensive head rotation, are also discussed, highlighting areas for future research.
Reach us at info@study.space