MotionClone: Training-Free Motion Cloning for Controllable Video Generation

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

28 Jun 2024 | Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin
MotionClone is a training-free framework for controllable text-to-video generation that enables motion cloning from a reference video. The framework employs temporal attention in video inversion to represent motion in the reference video and introduces primary temporal-attention guidance to mitigate the influence of noisy or subtle motions. Additionally, a location-aware semantic guidance mechanism is proposed to assist the generation model in synthesizing spatial relationships and enhancing its prompt-following capability. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency. The framework of MotionClone comprises primary temporal-attention guidance and location-aware semantic guidance. Primary temporal-attention guidance leverages the principal components of temporal-attention weights for motion-guided video generation, enabling the model to focus on the primary motion while suppressing noisy or less significant motions. Location-aware semantic guidance leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation, maintaining generative flexibility while enhancing the rationality of spatial relationships in the synthesized video. MotionClone is compared with existing methods such as VideoComposer, Tune-A-Video, Control-A-Video, VMC, Gen-1, and MotionCtrl. The results show that MotionClone achieves better motion quality with excellent details preservation and superior textual alignment. The framework is able to clone motion from reference videos with high fidelity and textual alignment, demonstrating its effectiveness in both global camera motion and local object motion scenarios. However, there are inherent limitations, such as the need for appropriate motion in the reference video and the potential retention of structural elements from the reference video. Despite these limitations, MotionClone presents significant advancements in the field of AI-driven video generation and carries distinct societal implications, both beneficial and challenging.MotionClone is a training-free framework for controllable text-to-video generation that enables motion cloning from a reference video. The framework employs temporal attention in video inversion to represent motion in the reference video and introduces primary temporal-attention guidance to mitigate the influence of noisy or subtle motions. Additionally, a location-aware semantic guidance mechanism is proposed to assist the generation model in synthesizing spatial relationships and enhancing its prompt-following capability. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency. The framework of MotionClone comprises primary temporal-attention guidance and location-aware semantic guidance. Primary temporal-attention guidance leverages the principal components of temporal-attention weights for motion-guided video generation, enabling the model to focus on the primary motion while suppressing noisy or less significant motions. Location-aware semantic guidance leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation, maintaining generative flexibility while enhancing the rationality of spatial relationships in the synthesized video. MotionClone is compared with existing methods such as VideoComposer, Tune-A-Video, Control-A-Video, VMC, Gen-1, and MotionCtrl. The results show that MotionClone achieves better motion quality with excellent details preservation and superior textual alignment. The framework is able to clone motion from reference videos with high fidelity and textual alignment, demonstrating its effectiveness in both global camera motion and local object motion scenarios. However, there are inherent limitations, such as the need for appropriate motion in the reference video and the potential retention of structural elements from the reference video. Despite these limitations, MotionClone presents significant advancements in the field of AI-driven video generation and carries distinct societal implications, both beneficial and challenging.
Reach us at info@study.space
[slides] MotionClone%3A Training-Free Motion Cloning for Controllable Video Generation | StudySpace