Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

25 Mar 2024 | Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun, Juan Cao, Jintao Li, Tong-Yee Lee
This paper introduces "Make-Your-Anchor," a diffusion-based 2D avatar generation framework that can generate realistic human videos with full-body motion from a single-minute video clip of an individual. The system uses a structure-guided diffusion model to map 3D mesh sequences into human appearances, and a two-stage training strategy to bind movements with specific appearances. To generate long temporal videos, the system extends the 2D U-Net to a 3D style without additional training, and proposes a batch-overlapped temporal denoising module to bypass video length constraints during inference. A novel identity-specific face enhancement module is introduced to improve facial quality. Comparative experiments show that the system outperforms state-of-the-art diffusion and non-diffusion methods in terms of visual quality, temporal coherence, and identity preservation. The system can be combined with motion capture or audio-to-motion methods to generate anchor-style videos with realistic motion and appearance. The system is trained on a public anchor video dataset and can generate high-quality human videos with realistic expressions, gestures, and body movements. The system is efficient, requiring only a single 40G A100 GPU for training and inference. The system can generate videos with arbitrary length and is capable of handling complex motions and expressions. The system is evaluated on ten anchors and shows superior performance compared to other methods. The system is effective in generating high-quality, temporally consistent, and identity-preserving human videos. The system is also efficient in terms of computational cost and can be applied to various real-world scenarios such as e-commerce, online education, and virtual reality. The system is a promising solution for generating realistic and high-quality digital avatars.This paper introduces "Make-Your-Anchor," a diffusion-based 2D avatar generation framework that can generate realistic human videos with full-body motion from a single-minute video clip of an individual. The system uses a structure-guided diffusion model to map 3D mesh sequences into human appearances, and a two-stage training strategy to bind movements with specific appearances. To generate long temporal videos, the system extends the 2D U-Net to a 3D style without additional training, and proposes a batch-overlapped temporal denoising module to bypass video length constraints during inference. A novel identity-specific face enhancement module is introduced to improve facial quality. Comparative experiments show that the system outperforms state-of-the-art diffusion and non-diffusion methods in terms of visual quality, temporal coherence, and identity preservation. The system can be combined with motion capture or audio-to-motion methods to generate anchor-style videos with realistic motion and appearance. The system is trained on a public anchor video dataset and can generate high-quality human videos with realistic expressions, gestures, and body movements. The system is efficient, requiring only a single 40G A100 GPU for training and inference. The system can generate videos with arbitrary length and is capable of handling complex motions and expressions. The system is evaluated on ten anchors and shows superior performance compared to other methods. The system is effective in generating high-quality, temporally consistent, and identity-preserving human videos. The system is also efficient in terms of computational cost and can be applied to various real-world scenarios such as e-commerce, online education, and virtual reality. The system is a promising solution for generating realistic and high-quality digital avatars.
Reach us at info@study.space