X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

July 27-August 1, 2024 | You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo
X-Portrait is an innovative conditional diffusion model designed for generating expressive and temporally coherent portrait animations. Given a single reference portrait, X-Portrait synthesizes compelling animations by transferring motion from a driving video, capturing both dynamic and subtle facial expressions and wide-range head movements. The model leverages a pre-trained diffusion model as the rendering backbone and uses novel controlling signals within the ControlNet framework to achieve fine-grained head pose and expression control. Unlike conventional explicit controls, the motion control module interprets dynamics directly from the driving video. A patch-based local control module enhances motion attention to small-scale nuances like eyeball positions. To mitigate identity leakage, motion control modules are trained with scaling-augmented cross-identity images. Experimental results show X-Portrait's effectiveness across diverse facial portraits and expressive driving sequences, demonstrating superior visual fidelity, identity resemblance, and motion accuracy. X-Portrait's contributions include a novel zero-shot portrait animation method, an implicit motion control scheme, enhanced interpretation of subtle facial expressions, and fine-tuning-free portrait animation results. The method uses a latent diffusion model with disentangled control of appearance and motion, and incorporates a cross-identity training scheme to enable direct driving motion derivation. An auxiliary ControlNet guides conditional motion attention to local facial movements. X-Portrait preserves source identity characteristics and background content while accurately following driving frame head poses and expressions. The model is trained on a large dataset of diverse expressions and speeches, and evaluated on in-the-wild portraits and test videos. X-Portrait outperforms state-of-the-art portrait animation baselines in terms of image quality, motion accuracy, and identity preservation. The method achieves high perceptual quality, motion richness, identity preservation, and domain generalization.X-Portrait is an innovative conditional diffusion model designed for generating expressive and temporally coherent portrait animations. Given a single reference portrait, X-Portrait synthesizes compelling animations by transferring motion from a driving video, capturing both dynamic and subtle facial expressions and wide-range head movements. The model leverages a pre-trained diffusion model as the rendering backbone and uses novel controlling signals within the ControlNet framework to achieve fine-grained head pose and expression control. Unlike conventional explicit controls, the motion control module interprets dynamics directly from the driving video. A patch-based local control module enhances motion attention to small-scale nuances like eyeball positions. To mitigate identity leakage, motion control modules are trained with scaling-augmented cross-identity images. Experimental results show X-Portrait's effectiveness across diverse facial portraits and expressive driving sequences, demonstrating superior visual fidelity, identity resemblance, and motion accuracy. X-Portrait's contributions include a novel zero-shot portrait animation method, an implicit motion control scheme, enhanced interpretation of subtle facial expressions, and fine-tuning-free portrait animation results. The method uses a latent diffusion model with disentangled control of appearance and motion, and incorporates a cross-identity training scheme to enable direct driving motion derivation. An auxiliary ControlNet guides conditional motion attention to local facial movements. X-Portrait preserves source identity characteristics and background content while accurately following driving frame head poses and expressions. The model is trained on a large dataset of diverse expressions and speeches, and evaluated on in-the-wild portraits and test videos. X-Portrait outperforms state-of-the-art portrait animation baselines in terms of image quality, motion accuracy, and identity preservation. The method achieves high perceptual quality, motion richness, identity preservation, and domain generalization.
Reach us at info@study.space