MotionBooth: Motion-Aware Customized Text-to-Video Generation

MotionBooth: Motion-Aware Customized Text-to-Video Generation

21 Aug 2024 | Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen
MotionBooth is a novel framework for motion-aware customized text-to-video generation. The method enables precise control over both subject and camera motions during video generation. By leveraging a few images of a specific object, MotionBooth efficiently fine-tunes a text-to-video model to capture the object's shape and attributes. The framework introduces subject region loss and video preservation loss to enhance subject learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, training-free techniques are proposed for managing subject and camera motions during inference. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of the method. Models and codes will be made publicly available. MotionBooth learns subjects without hurting video generation capability, enabling training-free motion injection for subject-driven video generation. The framework introduces subject region loss and video preservation loss to enhance subject fidelity and video quality. During inference, training-free techniques are used to control the camera and subject motion. The method demonstrates superior performance in motion control and can be applied to different base T2V models without further tuning. The contributions include a unified framework for motion-aware customized video generation, a novel loss-augmented training architecture for subject learning, and innovative training-free methods for controlling subject and camera motions.MotionBooth is a novel framework for motion-aware customized text-to-video generation. The method enables precise control over both subject and camera motions during video generation. By leveraging a few images of a specific object, MotionBooth efficiently fine-tunes a text-to-video model to capture the object's shape and attributes. The framework introduces subject region loss and video preservation loss to enhance subject learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, training-free techniques are proposed for managing subject and camera motions during inference. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of the method. Models and codes will be made publicly available. MotionBooth learns subjects without hurting video generation capability, enabling training-free motion injection for subject-driven video generation. The framework introduces subject region loss and video preservation loss to enhance subject fidelity and video quality. During inference, training-free techniques are used to control the camera and subject motion. The method demonstrates superior performance in motion control and can be applied to different base T2V models without further tuning. The contributions include a unified framework for motion-aware customized video generation, a novel loss-augmented training architecture for subject learning, and innovative training-free methods for controlling subject and camera motions.
Reach us at info@study.space