1 Jun 2024 | Shenhao Zhu*, Junming Leo Chen*, Zuozhuo Dai, Qingkun Su, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu
This paper introduces a novel method for human image animation called Champ, which leverages a 3D parametric human model (SMPL) within a latent diffusion framework to enhance shape alignment and motion guidance in human generative techniques. The method uses the SMPL model to establish a unified representation of body shape and pose, enabling accurate capture of intricate human geometry and motion characteristics from source videos. It incorporates rendered depth images, normal maps, and semantic maps from SMPL sequences, along with skeleton-based motion guidance, to enrich the conditions for the latent diffusion model with comprehensive 3D shape and detailed pose attributes. A multi-layer motion fusion module, integrating self-attention mechanisms, is used to fuse shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as motion guidance, the method can perform parametric shape alignment between the reference image and the source video motion. Experimental evaluations on benchmark datasets demonstrate the method's ability to generate high-quality human animations that accurately capture both pose and shape variations. The method also shows superior generalization capabilities on a proposed in-the-wild dataset. The approach is structured around three fundamental components: 1) sequences derived from the SMPL model are projected onto the image space to generate depth images, normal maps, and semantic maps that encapsulate essential 3D information. 2) Skeleton-based motion guidance enhances the precision of guidance information, particularly for intricate movements. 3) Self-attention mechanisms facilitate the feature maps in learning representative saliency regions, enhancing the model's capability to comprehend and generate human postures and shapes. The method has been evaluated on the TikTok and UBC fashion video datasets, showing effectiveness in improving the quality of human image animation. A comparative analysis against state-of-the-art approaches on a novel video dataset demonstrates the robust generalization capabilities of the proposed approach. The method uses a 3D parametric human model to encode the 3D geometry of the reference image and extract human motion from the source videos. The SMPL model provides a unified representation that encompasses both shape and pose variations using a low-dimensional parameter space. The method integrates depth, normal, semantic, and skeleton maps through feature encoding, employing self-attention mechanisms to learn representative saliency regions. The inclusion of these multi-layer feature embeddings conditioned on a latent video diffusion model leads to precise image animation both in pose and shape. The method is evaluated on benchmark datasets, showing superior performance in terms of image quality, video fidelity, and generalization to unseen domains. The method also demonstrates the effectiveness of cross-identity animation and multi-view animation. The method is limited in its modeling capacity for faces and hands, but incorporates DWpose as an additional constraint for facial and hand modeling. The method is efficient and has been analyzed in terms of GPU memory requirements and time consumption for different steps. The method has the potential to advance digital content creation in fields requiring detailed and realistic human representations.This paper introduces a novel method for human image animation called Champ, which leverages a 3D parametric human model (SMPL) within a latent diffusion framework to enhance shape alignment and motion guidance in human generative techniques. The method uses the SMPL model to establish a unified representation of body shape and pose, enabling accurate capture of intricate human geometry and motion characteristics from source videos. It incorporates rendered depth images, normal maps, and semantic maps from SMPL sequences, along with skeleton-based motion guidance, to enrich the conditions for the latent diffusion model with comprehensive 3D shape and detailed pose attributes. A multi-layer motion fusion module, integrating self-attention mechanisms, is used to fuse shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as motion guidance, the method can perform parametric shape alignment between the reference image and the source video motion. Experimental evaluations on benchmark datasets demonstrate the method's ability to generate high-quality human animations that accurately capture both pose and shape variations. The method also shows superior generalization capabilities on a proposed in-the-wild dataset. The approach is structured around three fundamental components: 1) sequences derived from the SMPL model are projected onto the image space to generate depth images, normal maps, and semantic maps that encapsulate essential 3D information. 2) Skeleton-based motion guidance enhances the precision of guidance information, particularly for intricate movements. 3) Self-attention mechanisms facilitate the feature maps in learning representative saliency regions, enhancing the model's capability to comprehend and generate human postures and shapes. The method has been evaluated on the TikTok and UBC fashion video datasets, showing effectiveness in improving the quality of human image animation. A comparative analysis against state-of-the-art approaches on a novel video dataset demonstrates the robust generalization capabilities of the proposed approach. The method uses a 3D parametric human model to encode the 3D geometry of the reference image and extract human motion from the source videos. The SMPL model provides a unified representation that encompasses both shape and pose variations using a low-dimensional parameter space. The method integrates depth, normal, semantic, and skeleton maps through feature encoding, employing self-attention mechanisms to learn representative saliency regions. The inclusion of these multi-layer feature embeddings conditioned on a latent video diffusion model leads to precise image animation both in pose and shape. The method is evaluated on benchmark datasets, showing superior performance in terms of image quality, video fidelity, and generalization to unseen domains. The method also demonstrates the effectiveness of cross-identity animation and multi-view animation. The method is limited in its modeling capacity for faces and hands, but incorporates DWpose as an additional constraint for facial and hand modeling. The method is efficient and has been analyzed in terms of GPU memory requirements and time consumption for different steps. The method has the potential to advance digital content creation in fields requiring detailed and realistic human representations.