Understanding Make-Your-Anchor%3A A Diffusion-based 2D Avatar Generation Framework

The paper introduces "Make-Your-Anchor," a diffusion-based 2D avatar generation framework designed to create realistic and high-quality anchor-style human videos. The system requires only a one-minute video clip of an individual for training and can automatically generate videos with precise torso and hand movements. The key contributions include: 1. **Frame-wise Motion-to-Appearance Diffusing**: A structure-guided diffusion model (SGDM) is proposed to bind movements with specific appearances, using a two-stage training strategy to enhance motion generation and fine-tune the model on a single identity. 2. **Batch-overlapped Temporal Denoising**: An all-frame cross-attention module is introduced to generate temporal-consistent videos without additional training, addressing the randomness in diffusion model outputs. 3. **Identity-Specific Face Enhancement**: An inpainting-based enhancement module is applied to improve the visual quality of facial regions in the output videos. The system is evaluated on a dataset of ten identities, demonstrating superior performance in terms of visual quality, temporal coherence, and identity preservation compared to state-of-the-art methods. The paper also includes ablation studies to validate the effectiveness of each component and discusses limitations and future work.The paper introduces "Make-Your-Anchor," a diffusion-based 2D avatar generation framework designed to create realistic and high-quality anchor-style human videos. The system requires only a one-minute video clip of an individual for training and can automatically generate videos with precise torso and hand movements. The key contributions include: 1. **Frame-wise Motion-to-Appearance Diffusing**: A structure-guided diffusion model (SGDM) is proposed to bind movements with specific appearances, using a two-stage training strategy to enhance motion generation and fine-tune the model on a single identity. 2. **Batch-overlapped Temporal Denoising**: An all-frame cross-attention module is introduced to generate temporal-consistent videos without additional training, addressing the randomness in diffusion model outputs. 3. **Identity-Specific Face Enhancement**: An inpainting-based enhancement module is applied to improve the visual quality of facial regions in the output videos. The system is evaluated on a dataset of ten identities, demonstrating superior performance in terms of visual quality, temporal coherence, and identity preservation compared to state-of-the-art methods. The paper also includes ablation studies to validate the effectiveness of each component and discusses limitations and future work.

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

25 Mar 2024 | Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun, Juan Cao, Jintao Li, Tong-Yee Lee