Understanding MotionClone%3A Training-Free Motion Cloning for Controllable Video Generation

**MotionClone: Training-Free Motion Cloning for Controllable Video Generation** **Authors:** Pengyang Ling **Affiliations:** University of Science and Technology of China, Shanghai Jiao Tong University, The Chinese University of Hong Kong, Shanghai AI Laboratory **URL:** https://github.com/Bujiazi/MotionClone/ **Abstract:** Motion-based controllable text-to-video generation involves encoding motion cues from a reference video to control video generation. Previous methods often require training models to encode motion cues or fine-tune video diffusion models, leading to suboptimal results outside the trained domain. This work introduces MotionClone, a training-free framework that clones motion from a reference video to control text-to-video generation. It employs temporal attention in video inversion to represent motions and introduces primary temporal-attention guidance to mitigate the influence of noisy or subtle motions. Additionally, a location-aware semantic guidance mechanism is proposed to assist the generation model in synthesizing reasonable spatial relationships and enhancing prompt-following capability. Extensive experiments demonstrate that MotionClone excels in both global camera motion and local object motion, with notable improvements in motion fidelity, textual alignment, and temporal consistency. **Introduction:** The generation of videos aligned with human intentions and producing high-quality outputs has gained significant attention. Text-to-video (T2V) models have achieved substantial progress, but challenges remain due to the complexities of motion synthesis. MotionClone diverges from traditional approaches by using temporal attention to capture motion in the reference video, effectively rendering detailed motion while preserving minimal interdependencies with the reference video's structural components. The framework includes primary temporal-attention guidance and location-aware semantic guidance to enhance motion cloning and spatial relationship synthesis. **Related Work:** - **Text-to-video diffusion models:** Advances in T2I generation have inspired T2V models, but they still struggle with motion quality and data scarcity. - **Controllable video generation:** Studies have explored diverse control signals for versatile video generation, including motion trajectory, region, and object. - **Attention control:** Attention mechanisms are crucial for high-quality content generation, with recent work focusing on spatial and temporal attention blocks. **Methodology:** - **Primary temporal-attention guidance:** Aligns the primary temporal-attention components of the generated latent with those of the reference latent to clone motion. - **Location-aware semantic guidance:** Derives coarse object masks from the cross-attention layers and uses them to guide the generation process, enhancing spatial relationships and prompt-following capability. **Experiments:** - **Implementation details:** AnimateDiff is used as the base model, with specific settings for reference video processing and guidance application. - **Experimental setup:** DAVIS dataset is used for evaluation, with metrics including textual alignment, temporal consistency, and user study results. - **Baselines:** Various methods are compared, including VideoComposer, Tune-A-video, Control-A-Video, VMC, Gen-1, and MotionCtrl. - **Qualitative and quantitative comparisons:****MotionClone: Training-Free Motion Cloning for Controllable Video Generation** **Authors:** Pengyang Ling **Affiliations:** University of Science and Technology of China, Shanghai Jiao Tong University, The Chinese University of Hong Kong, Shanghai AI Laboratory **URL:** https://github.com/Bujiazi/MotionClone/ **Abstract:** Motion-based controllable text-to-video generation involves encoding motion cues from a reference video to control video generation. Previous methods often require training models to encode motion cues or fine-tune video diffusion models, leading to suboptimal results outside the trained domain. This work introduces MotionClone, a training-free framework that clones motion from a reference video to control text-to-video generation. It employs temporal attention in video inversion to represent motions and introduces primary temporal-attention guidance to mitigate the influence of noisy or subtle motions. Additionally, a location-aware semantic guidance mechanism is proposed to assist the generation model in synthesizing reasonable spatial relationships and enhancing prompt-following capability. Extensive experiments demonstrate that MotionClone excels in both global camera motion and local object motion, with notable improvements in motion fidelity, textual alignment, and temporal consistency. **Introduction:** The generation of videos aligned with human intentions and producing high-quality outputs has gained significant attention. Text-to-video (T2V) models have achieved substantial progress, but challenges remain due to the complexities of motion synthesis. MotionClone diverges from traditional approaches by using temporal attention to capture motion in the reference video, effectively rendering detailed motion while preserving minimal interdependencies with the reference video's structural components. The framework includes primary temporal-attention guidance and location-aware semantic guidance to enhance motion cloning and spatial relationship synthesis. **Related Work:** - **Text-to-video diffusion models:** Advances in T2I generation have inspired T2V models, but they still struggle with motion quality and data scarcity. - **Controllable video generation:** Studies have explored diverse control signals for versatile video generation, including motion trajectory, region, and object. - **Attention control:** Attention mechanisms are crucial for high-quality content generation, with recent work focusing on spatial and temporal attention blocks. **Methodology:** - **Primary temporal-attention guidance:** Aligns the primary temporal-attention components of the generated latent with those of the reference latent to clone motion. - **Location-aware semantic guidance:** Derives coarse object masks from the cross-attention layers and uses them to guide the generation process, enhancing spatial relationships and prompt-following capability. **Experiments:** - **Implementation details:** AnimateDiff is used as the base model, with specific settings for reference video processing and guidance application. - **Experimental setup:** DAVIS dataset is used for evaluation, with metrics including textual alignment, temporal consistency, and user study results. - **Baselines:** Various methods are compared, including VideoComposer, Tune-A-video, Control-A-Video, VMC, Gen-1, and MotionCtrl. - **Qualitative and quantitative comparisons:**

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

28 Jun 2024 | Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin