July 27-August 1, 2024 | You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo
X-Portrait is an innovative conditional diffusion model designed for generating expressive and temporally coherent portrait animations. Given a single reference portrait, X-Portrait synthesizes compelling animations by transferring motion from a driving video, capturing both dynamic and subtle facial expressions and wide-range head movements. The model leverages a pre-trained diffusion model as the rendering backbone and uses novel controlling signals within the ControlNet framework to achieve fine-grained head pose and expression control. Unlike conventional explicit controls, the motion control module interprets dynamics directly from the driving video. A patch-based local control module enhances motion attention to small-scale nuances like eyeball positions. To mitigate identity leakage, motion control modules are trained with scaling-augmented cross-identity images. Experimental results show X-Portrait's effectiveness across diverse facial portraits and expressive driving sequences, demonstrating superior visual fidelity, identity resemblance, and motion accuracy. X-Portrait's contributions include a novel zero-shot portrait animation method, an implicit motion control scheme, enhanced interpretation of subtle facial expressions, and fine-tuning-free portrait animation results. The method uses a latent diffusion model with disentangled control of appearance and motion, and incorporates a cross-identity training scheme to enable direct driving motion derivation. An auxiliary ControlNet guides conditional motion attention to local facial movements. X-Portrait preserves source identity characteristics and background content while accurately following driving frame head poses and expressions. The model is trained on a large dataset of diverse expressions and speeches, and evaluated on in-the-wild portraits and test videos. X-Portrait outperforms state-of-the-art portrait animation baselines in terms of image quality, motion accuracy, and identity preservation. The method achieves high perceptual quality, motion richness, identity preservation, and domain generalization.X-Portrait is an innovative conditional diffusion model designed for generating expressive and temporally coherent portrait animations. Given a single reference portrait, X-Portrait synthesizes compelling animations by transferring motion from a driving video, capturing both dynamic and subtle facial expressions and wide-range head movements. The model leverages a pre-trained diffusion model as the rendering backbone and uses novel controlling signals within the ControlNet framework to achieve fine-grained head pose and expression control. Unlike conventional explicit controls, the motion control module interprets dynamics directly from the driving video. A patch-based local control module enhances motion attention to small-scale nuances like eyeball positions. To mitigate identity leakage, motion control modules are trained with scaling-augmented cross-identity images. Experimental results show X-Portrait's effectiveness across diverse facial portraits and expressive driving sequences, demonstrating superior visual fidelity, identity resemblance, and motion accuracy. X-Portrait's contributions include a novel zero-shot portrait animation method, an implicit motion control scheme, enhanced interpretation of subtle facial expressions, and fine-tuning-free portrait animation results. The method uses a latent diffusion model with disentangled control of appearance and motion, and incorporates a cross-identity training scheme to enable direct driving motion derivation. An auxiliary ControlNet guides conditional motion attention to local facial movements. X-Portrait preserves source identity characteristics and background content while accurately following driving frame head poses and expressions. The model is trained on a large dataset of diverse expressions and speeches, and evaluated on in-the-wild portraits and test videos. X-Portrait outperforms state-of-the-art portrait animation baselines in terms of image quality, motion accuracy, and identity preservation. The method achieves high perceptual quality, motion richness, identity preservation, and domain generalization.