Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

6 May 2024 | Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, Jing Liao
Direct-a-Video is a text-to-video generation framework that allows users to independently control camera movement and object motion. The system enables users to specify camera pan and zoom movements as well as the motion of scene objects, allowing them to create customized video content. The framework decouples camera and object control using two orthogonal mechanisms. For camera movement, a self-supervised training approach is employed, enabling users to quantitatively specify camera movement parameters. For object motion, a training-free spatial cross-attention modulation is used, allowing users to define motion trajectories by drawing bounding boxes. This approach avoids the need for extensive motion annotations and provides greater flexibility in video generation. The system is evaluated on various metrics, including FVD, FID-vid, and flow error, demonstrating superior performance compared to existing methods. The framework also supports joint control of camera and object motion, enabling users to create dynamic and customized video content. The method is effective in handling both single and multiple object scenarios, and the results show that it can generate high-quality videos with precise control over motion. The system is implemented using a pretrained T2V model, with additional modules for camera and object control. The framework is designed to be efficient and flexible, making it suitable for creative video synthesis with customized motion.Direct-a-Video is a text-to-video generation framework that allows users to independently control camera movement and object motion. The system enables users to specify camera pan and zoom movements as well as the motion of scene objects, allowing them to create customized video content. The framework decouples camera and object control using two orthogonal mechanisms. For camera movement, a self-supervised training approach is employed, enabling users to quantitatively specify camera movement parameters. For object motion, a training-free spatial cross-attention modulation is used, allowing users to define motion trajectories by drawing bounding boxes. This approach avoids the need for extensive motion annotations and provides greater flexibility in video generation. The system is evaluated on various metrics, including FVD, FID-vid, and flow error, demonstrating superior performance compared to existing methods. The framework also supports joint control of camera and object motion, enabling users to create dynamic and customized video content. The method is effective in handling both single and multiple object scenarios, and the results show that it can generate high-quality videos with precise control over motion. The system is implemented using a pretrained T2V model, with additional modules for camera and object control. The framework is designed to be efficient and flexible, making it suitable for creative video synthesis with customized motion.
Reach us at info@study.space
[slides] Direct-a-Video%3A Customized Video Generation with User-Directed Camera Movement and Object Motion | StudySpace