28 Jun 2024 | Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Yuefeng Zhu, Fangyuan Zou, Junqi Cheng
MimicMotion is a novel framework for generating high-quality, pose-guided human motion videos. It addresses the challenges of video generation, such as controllability, video length, and detail richness, by introducing confidence-aware pose guidance, regional loss amplification, and a progressive latent fusion strategy. Confidence-aware pose guidance ensures high frame quality and temporal smoothness by adapting the influence of pose guidance based on keypoint confidence scores. Regional loss amplification reduces image distortion, particularly in hand regions, by amplifying the impact of high-confidence regions. The progressive latent fusion strategy enables the generation of long, smooth videos by fusing overlapping video segments, maintaining temporal coherence. Extensive experiments and user studies demonstrate that MimicMotion outperforms existing methods in terms of video quality, temporal smoothness, and user preference. The framework is trained on a pre-trained video diffusion model, leveraging its image-to-video generation capabilities, and is designed to handle arbitrary-length videos with acceptable resource consumption.MimicMotion is a novel framework for generating high-quality, pose-guided human motion videos. It addresses the challenges of video generation, such as controllability, video length, and detail richness, by introducing confidence-aware pose guidance, regional loss amplification, and a progressive latent fusion strategy. Confidence-aware pose guidance ensures high frame quality and temporal smoothness by adapting the influence of pose guidance based on keypoint confidence scores. Regional loss amplification reduces image distortion, particularly in hand regions, by amplifying the impact of high-confidence regions. The progressive latent fusion strategy enables the generation of long, smooth videos by fusing overlapping video segments, maintaining temporal coherence. Extensive experiments and user studies demonstrate that MimicMotion outperforms existing methods in terms of video quality, temporal smoothness, and user preference. The framework is trained on a pre-trained video diffusion model, leveraging its image-to-video generation capabilities, and is designed to handle arbitrary-length videos with acceptable resource consumption.