17 May 2024 | Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang
CapHuman is a novel framework for human-centric image synthesis that generates photo-realistic and diverse portraits of specific individuals based on a single reference facial photograph. The framework enables generalizable identity preservation and fine-grained head control, allowing for various head positions, poses, facial expressions, and illuminations in different contexts. CapHuman is built upon the pre-trained text-to-image diffusion model, Stable Diffusion, and employs the "encode then learn to align" paradigm to ensure generalizable identity preservation without requiring cumbersome tuning at inference. The framework also incorporates a 3D facial prior to provide flexible and 3D-consistent head control. CapHuman introduces a new benchmark, HumanIPH C, to evaluate identity preservation, text-to-image alignment, and head control precision. The framework achieves impressive qualitative and quantitative results compared to established baselines, demonstrating its effectiveness. CapHuman can generate well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions. The framework is also adaptable to other pre-trained models, enabling flexible and diverse image generation. The method outperforms other baselines in terms of identity preservation, text-to-image alignment, and head control precision. CapHuman is the first framework to preserve individual identity while enabling text and head control in human-centric image synthesis. The framework is evaluated through extensive experiments, including qualitative and quantitative analyses, and user studies, demonstrating its effectiveness and versatility.CapHuman is a novel framework for human-centric image synthesis that generates photo-realistic and diverse portraits of specific individuals based on a single reference facial photograph. The framework enables generalizable identity preservation and fine-grained head control, allowing for various head positions, poses, facial expressions, and illuminations in different contexts. CapHuman is built upon the pre-trained text-to-image diffusion model, Stable Diffusion, and employs the "encode then learn to align" paradigm to ensure generalizable identity preservation without requiring cumbersome tuning at inference. The framework also incorporates a 3D facial prior to provide flexible and 3D-consistent head control. CapHuman introduces a new benchmark, HumanIPH C, to evaluate identity preservation, text-to-image alignment, and head control precision. The framework achieves impressive qualitative and quantitative results compared to established baselines, demonstrating its effectiveness. CapHuman can generate well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions. The framework is also adaptable to other pre-trained models, enabling flexible and diverse image generation. The method outperforms other baselines in terms of identity preservation, text-to-image alignment, and head control precision. CapHuman is the first framework to preserve individual identity while enabling text and head control in human-centric image synthesis. The framework is evaluated through extensive experiments, including qualitative and quantitative analyses, and user studies, demonstrating its effectiveness and versatility.