[slides] MotionBooth%3A Motion-Aware Customized Text-to-Video Generation

**MotionBooth: Motion-Aware Customized Text-to-Video Generation** **Project Page:** <https://jianzongwu.github.io/projects/motionbooth> **Email:** jzwu@stu.pku.edu.cn, xiangtai94@gmail.com **Abstract:** MotionBooth is an innovative framework designed for generating videos with customized subjects and precise control over both object and camera movements. By fine-tuning a text-to-video (T2V) model using a few images of a specific object, MotionBooth accurately captures the object's shape and attributes. The framework introduces subject region loss, video preservation loss, and a subject token cross-attention loss to enhance subject learning and video quality. During inference, training-free techniques are proposed to control camera and subject motions, including a latent shift module for camera movement control and cross-attention map manipulation for subject motion control. Extensive evaluations demonstrate the superiority and effectiveness of MotionBooth in generating diverse videos with controllable subject and camera motions. **Introduction:** Generating videos with customized subjects, such as specific scenarios involving a particular dog or toy, has gained significant attention. Existing methods often struggle with subject learning and video motion preservation, leading to issues like background degradation and static videos. MotionBooth addresses these challenges by preserving video generation capability during subject learning and integrating motion control during inference. **Method:** - **Task Formulation:** MotionBooth generates videos with customized subjects and controlled camera and subject motions. - **Overall Pipeline:** The pipeline includes subject learning using a T2V model fine-tuned on a few images, incorporating subject region loss, video preservation loss, and subject token cross-attention loss. During inference, training-free techniques control camera and subject motions. - **Subject Learning:** MotionBooth introduces subject region loss to prevent background overfitting and video preservation loss to maintain video generation capabilities. The subject token cross-attention loss connects subject tokens with their positions in cross-attention maps. - **Subject Motion Control:** Bounding boxes are used to control subject positions during inference, edited through cross-attention maps. - **Camera Movement Control:** A latent shift module shifts the noised latent to control camera movement, ensuring smooth transitions. **Experiments:** - **Datasets and Implementation Details:** MotionBooth is evaluated on datasets and trained using specific hyperparameters. - **Baselines and Evaluation Metrics:** Comparisons with related works on motion-aware customized video generation and camera movement control are conducted. - **Quantitative and Qualitative Results:** MotionBooth outperforms baselines in subject fidelity, temporal consistency, and camera motion fidelity. **Conclusion:** MotionBooth is a novel framework for motion-aware, customized video generation, effectively controlling both object and camera movements while preserving video quality.**MotionBooth: Motion-Aware Customized Text-to-Video Generation** **Project Page:** <https://jianzongwu.github.io/projects/motionbooth> **Email:** jzwu@stu.pku.edu.cn, xiangtai94@gmail.com **Abstract:** MotionBooth is an innovative framework designed for generating videos with customized subjects and precise control over both object and camera movements. By fine-tuning a text-to-video (T2V) model using a few images of a specific object, MotionBooth accurately captures the object's shape and attributes. The framework introduces subject region loss, video preservation loss, and a subject token cross-attention loss to enhance subject learning and video quality. During inference, training-free techniques are proposed to control camera and subject motions, including a latent shift module for camera movement control and cross-attention map manipulation for subject motion control. Extensive evaluations demonstrate the superiority and effectiveness of MotionBooth in generating diverse videos with controllable subject and camera motions. **Introduction:** Generating videos with customized subjects, such as specific scenarios involving a particular dog or toy, has gained significant attention. Existing methods often struggle with subject learning and video motion preservation, leading to issues like background degradation and static videos. MotionBooth addresses these challenges by preserving video generation capability during subject learning and integrating motion control during inference. **Method:** - **Task Formulation:** MotionBooth generates videos with customized subjects and controlled camera and subject motions. - **Overall Pipeline:** The pipeline includes subject learning using a T2V model fine-tuned on a few images, incorporating subject region loss, video preservation loss, and subject token cross-attention loss. During inference, training-free techniques control camera and subject motions. - **Subject Learning:** MotionBooth introduces subject region loss to prevent background overfitting and video preservation loss to maintain video generation capabilities. The subject token cross-attention loss connects subject tokens with their positions in cross-attention maps. - **Subject Motion Control:** Bounding boxes are used to control subject positions during inference, edited through cross-attention maps. - **Camera Movement Control:** A latent shift module shifts the noised latent to control camera movement, ensuring smooth transitions. **Experiments:** - **Datasets and Implementation Details:** MotionBooth is evaluated on datasets and trained using specific hyperparameters. - **Baselines and Evaluation Metrics:** Comparisons with related works on motion-aware customized video generation and camera movement control are conducted. - **Quantitative and Qualitative Results:** MotionBooth outperforms baselines in subject fidelity, temporal consistency, and camera motion fidelity. **Conclusion:** MotionBooth is a novel framework for motion-aware, customized video generation, effectively controlling both object and camera movements while preserving video quality.

MotionBooth: Motion-Aware Customized Text-to-Video Generation

21 Aug 2024 | Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen