1 Jun 2024 | Shenhao Zhu*1, Junming Leo Chen*2, Zuozhuo Dai3, Qingkun Su3, Yinghui Xu2, Xun Cao1, Yao Yao1, Hao Zhu†1, and Siyu Zhu†2
This paper introduces a novel methodology for human image animation that integrates a 3D parametric human model (SMPL) within a latent diffusion framework. The approach enhances shape alignment and motion guidance, improving the quality and realism of generated animations. Key contributions include:
1. **3D Parametric Model**: Utilizes the SMPL model to establish a unified representation of body shape and pose, capturing intricate geometric and motion characteristics from source videos.
2. **Motion Guidance**: Incorporates rendered depth images, normal maps, and semantic maps from SMPL sequences, along with skeleton-based motion guidance, to enrich the latent diffusion model with comprehensive 3D shape and detailed pose attributes.
3. **Multi-Layer Motion Fusion**: A multi-layer motion fusion module integrates self-attention mechanisms to fuse shape and motion latent representations, enhancing the model's ability to generate accurate and temporally consistent animations.
4. **Experimental Validation**: Demonstrates superior performance on benchmark datasets (TikTok and UBC fashion video datasets) and a novel in-the-wild dataset, showing enhanced generalization capabilities.
The methodology is structured around three components:
1. **SMPL Model Integration**: Projects SMPL sequences onto the image space to generate depth, normal, and semantic maps.
2. **Skeleton-Based Motion Guidance**: Enhances precision in guiding intricate movements like facial expressions and finger movements.
3. **Multi-Layer Feature Embedding**: Utilizes self-attention mechanisms to integrate multi-layer feature embeddings conditioned on a latent video diffusion model, leading to precise image animation.
The paper also includes a comprehensive evaluation, comparing the proposed approach with state-of-the-art methods, and discusses limitations and future directions.This paper introduces a novel methodology for human image animation that integrates a 3D parametric human model (SMPL) within a latent diffusion framework. The approach enhances shape alignment and motion guidance, improving the quality and realism of generated animations. Key contributions include:
1. **3D Parametric Model**: Utilizes the SMPL model to establish a unified representation of body shape and pose, capturing intricate geometric and motion characteristics from source videos.
2. **Motion Guidance**: Incorporates rendered depth images, normal maps, and semantic maps from SMPL sequences, along with skeleton-based motion guidance, to enrich the latent diffusion model with comprehensive 3D shape and detailed pose attributes.
3. **Multi-Layer Motion Fusion**: A multi-layer motion fusion module integrates self-attention mechanisms to fuse shape and motion latent representations, enhancing the model's ability to generate accurate and temporally consistent animations.
4. **Experimental Validation**: Demonstrates superior performance on benchmark datasets (TikTok and UBC fashion video datasets) and a novel in-the-wild dataset, showing enhanced generalization capabilities.
The methodology is structured around three components:
1. **SMPL Model Integration**: Projects SMPL sequences onto the image space to generate depth, normal, and semantic maps.
2. **Skeleton-Based Motion Guidance**: Enhances precision in guiding intricate movements like facial expressions and finger movements.
3. **Multi-Layer Feature Embedding**: Utilizes self-attention mechanisms to integrate multi-layer feature embeddings conditioned on a latent video diffusion model, leading to precise image animation.
The paper also includes a comprehensive evaluation, comparing the proposed approach with state-of-the-art methods, and discusses limitations and future directions.