Understanding Follow-Your-Emoji%3A Fine-Controllable and Expressive Freestyle Portrait Animation

**Follow-Your-Emoji** is a diffusion-based framework for fine-controllable and expressive portrait animation. The main challenge in portrait animation is to preserve the identity of the reference portrait while transferring target expressions, maintaining temporal consistency, and fidelity. To address these challenges, the framework incorporates two key technologies: an expression-aware landmark and a facial fine-grained loss. 1. **Expression-Aware Landmark**: This landmark, derived from 3D keypoints using MediaPipe, ensures accurate motion alignment between the reference portrait and target motion, preventing identity leakage and enhancing the portrayal of exaggerated expressions. 2. **Facial Fine-Grained Loss**: This loss function helps the model focus on subtle expression changes and detailed appearance reconstruction by using both facial masks and expression masks. The method demonstrates significant performance in controlling the expressions of various styles of portraits, including real humans, cartoons, sculptures, and animals. A progressive generation strategy extends the model to stable long-term animation. To address the lack of a benchmark, the authors introduce EmojiBench, a comprehensive benchmark with diverse portrait images, driving videos, and landmarks. **Contributions**: - Introduction of Follow-Your-Emoji, a diffusion-based framework for fine-controllable portrait animation. - Proposal of expression-aware landmarks and a facial fine-grained loss to enhance the model's performance. - Construction of a high-quality expression training dataset and the EmojiBench benchmark. - Comprehensive evaluation showing superior performance in handling diverse portraits and motions. **Related Work**: - GAN-based methods often suffer from unrealistic content and artifacts due to limited performance and inaccurate motion representation. - Diffusion models, particularly Stable Diffusion, have shown better generation ability but struggle with identity preservation and precise expression control. **Experiments**: - The method is trained on HDTF, VFHQ, and a new dataset with 18 exaggerated expressions and 20-minute real-human videos. - EmojiBench, a comprehensive benchmark, is introduced to evaluate the model's performance. - Quantitative and qualitative comparisons with state-of-the-art methods demonstrate superior performance in identity preservation, expression generation, and temporal consistency. **Conclusion**: Follow-Your-Emoji effectively addresses the challenges of portrait animation by incorporating expression-aware landmarks and a facial fine-grained loss, achieving high-quality and expressive results.**Follow-Your-Emoji** is a diffusion-based framework for fine-controllable and expressive portrait animation. The main challenge in portrait animation is to preserve the identity of the reference portrait while transferring target expressions, maintaining temporal consistency, and fidelity. To address these challenges, the framework incorporates two key technologies: an expression-aware landmark and a facial fine-grained loss. 1. **Expression-Aware Landmark**: This landmark, derived from 3D keypoints using MediaPipe, ensures accurate motion alignment between the reference portrait and target motion, preventing identity leakage and enhancing the portrayal of exaggerated expressions. 2. **Facial Fine-Grained Loss**: This loss function helps the model focus on subtle expression changes and detailed appearance reconstruction by using both facial masks and expression masks. The method demonstrates significant performance in controlling the expressions of various styles of portraits, including real humans, cartoons, sculptures, and animals. A progressive generation strategy extends the model to stable long-term animation. To address the lack of a benchmark, the authors introduce EmojiBench, a comprehensive benchmark with diverse portrait images, driving videos, and landmarks. **Contributions**: - Introduction of Follow-Your-Emoji, a diffusion-based framework for fine-controllable portrait animation. - Proposal of expression-aware landmarks and a facial fine-grained loss to enhance the model's performance. - Construction of a high-quality expression training dataset and the EmojiBench benchmark. - Comprehensive evaluation showing superior performance in handling diverse portraits and motions. **Related Work**: - GAN-based methods often suffer from unrealistic content and artifacts due to limited performance and inaccurate motion representation. - Diffusion models, particularly Stable Diffusion, have shown better generation ability but struggle with identity preservation and precise expression control. **Experiments**: - The method is trained on HDTF, VFHQ, and a new dataset with 18 exaggerated expressions and 20-minute real-human videos. - EmojiBench, a comprehensive benchmark, is introduced to evaluate the model's performance. - Quantitative and qualitative comparisons with state-of-the-art methods demonstrate superior performance in identity preservation, expression generation, and temporal consistency. **Conclusion**: Follow-Your-Emoji effectively addresses the challenges of portrait animation by incorporating expression-aware landmarks and a facial fine-grained loss, achieving high-quality and expressive results.

Follow-Your-Emoji : Fine-Controllable and Expressive Freestyle Portrait Animation

7 Jun 2024 | Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen