MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

28 Jun 2024 | Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou
MimicMotion is a high-quality human motion video generation framework that produces long videos with pose guidance. The framework introduces confidence-aware pose guidance to ensure high frame quality and temporal smoothness. It also employs regional loss amplification based on pose confidence to reduce image distortion. Additionally, a progressive latent fusion strategy is proposed to generate long and smooth videos. The framework is trained using a pre-trained video generation model, which allows for efficient training with limited data. The model is evaluated on the TikTok dataset, showing significant improvements over previous methods in terms of video quality, temporal smoothness, and hand generation. The framework also includes hand region enhancement to improve the visual appeal of generated videos. The results demonstrate that MimicMotion achieves superior performance in generating high-quality, long videos with human motion guided by pose. The framework is effective in reducing the impact of inaccurate pose estimation and improving the temporal smoothness of generated videos. The model is implemented with a spatiotemporal U-Net and a PoseNet for introducing pose sequence as the condition. The confidence-aware pose guidance is integrated into the model to enhance the accuracy of pose-guided generation. The framework is evaluated through extensive experiments and user studies, showing its effectiveness in generating high-quality human motion videos.MimicMotion is a high-quality human motion video generation framework that produces long videos with pose guidance. The framework introduces confidence-aware pose guidance to ensure high frame quality and temporal smoothness. It also employs regional loss amplification based on pose confidence to reduce image distortion. Additionally, a progressive latent fusion strategy is proposed to generate long and smooth videos. The framework is trained using a pre-trained video generation model, which allows for efficient training with limited data. The model is evaluated on the TikTok dataset, showing significant improvements over previous methods in terms of video quality, temporal smoothness, and hand generation. The framework also includes hand region enhancement to improve the visual appeal of generated videos. The results demonstrate that MimicMotion achieves superior performance in generating high-quality, long videos with human motion guided by pose. The framework is effective in reducing the impact of inaccurate pose estimation and improving the temporal smoothness of generated videos. The model is implemented with a spatiotemporal U-Net and a PoseNet for introducing pose sequence as the condition. The confidence-aware pose guidance is integrated into the model to enhance the accuracy of pose-guided generation. The framework is evaluated through extensive experiments and user studies, showing its effectiveness in generating high-quality human motion videos.
Reach us at info@study.space