6 May 2024 | Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, Jing Liao
**Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion**
**Abstract:**
Recent text-to-video (T2V) diffusion models have achieved significant progress, but current methods lack the ability to independently control object motion and camera movement, limiting their flexibility. This paper introduces Direct-a-Video, a system that allows users to independently specify camera pan and zoom movements, as well as object motions, as if directing a video. The proposed strategy decouples the control of object motion and camera movement, using spatial cross-attention modulation for object motion and temporal cross-attention layers for camera movement. The camera module is trained through self-supervised augmentation, while object motion control is achieved without additional optimization. Extensive experiments demonstrate the superiority and effectiveness of the method.
**Contributions:**
- A unified framework for controllable video generation that decouples camera movement and object motion.
- Introduction of a novel temporal cross-attention module for camera movement conditioning.
- Utilization of spatial cross-attention modulation for object motion control without training.
**Keywords:**
Text-to-video generation, motion control, diffusion model.
**Related Work:**
The paper reviews existing T2V models and methods for video generation with controllable motion, highlighting the limitations of current approaches in handling user-defined and disentangled control over camera and object motion.
**Method:**
- **Task Formulation:** Users provide text prompts and specify camera and object motion parameters.
- **Overall Pipeline:** Camera movement is learned during training, while object motion is implemented during inference.
- **Camera Movement Control:** A self-supervised training approach using camera augmentation to simulate camera movement.
- **Object Motion Control:** Utilizes spatial cross-attention modulation to guide object placement without additional optimization.
**Experiments:**
- **Qualitative and Quantitative Comparisons:** Direct-a-Video outperforms baselines in camera movement control and object motion control, demonstrating superior visual quality and controllability.
- **Ablation Study:** Evaluates the effectiveness of key components such as attention amplification and suppression.
**Limitations:**
- Conflicts can arise in joint control scenarios.
- Limited ability to produce complex 3D camera movements.
- Issues with overlapping boxes in object control.
**Conclusion:**
Direct-a-Video provides a flexible and efficient tool for creative video synthesis with customized motion, addressing the need for independent and user-directed control over camera and object motion.**Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion**
**Abstract:**
Recent text-to-video (T2V) diffusion models have achieved significant progress, but current methods lack the ability to independently control object motion and camera movement, limiting their flexibility. This paper introduces Direct-a-Video, a system that allows users to independently specify camera pan and zoom movements, as well as object motions, as if directing a video. The proposed strategy decouples the control of object motion and camera movement, using spatial cross-attention modulation for object motion and temporal cross-attention layers for camera movement. The camera module is trained through self-supervised augmentation, while object motion control is achieved without additional optimization. Extensive experiments demonstrate the superiority and effectiveness of the method.
**Contributions:**
- A unified framework for controllable video generation that decouples camera movement and object motion.
- Introduction of a novel temporal cross-attention module for camera movement conditioning.
- Utilization of spatial cross-attention modulation for object motion control without training.
**Keywords:**
Text-to-video generation, motion control, diffusion model.
**Related Work:**
The paper reviews existing T2V models and methods for video generation with controllable motion, highlighting the limitations of current approaches in handling user-defined and disentangled control over camera and object motion.
**Method:**
- **Task Formulation:** Users provide text prompts and specify camera and object motion parameters.
- **Overall Pipeline:** Camera movement is learned during training, while object motion is implemented during inference.
- **Camera Movement Control:** A self-supervised training approach using camera augmentation to simulate camera movement.
- **Object Motion Control:** Utilizes spatial cross-attention modulation to guide object placement without additional optimization.
**Experiments:**
- **Qualitative and Quantitative Comparisons:** Direct-a-Video outperforms baselines in camera movement control and object motion control, demonstrating superior visual quality and controllability.
- **Ablation Study:** Evaluates the effectiveness of key components such as attention amplification and suppression.
**Limitations:**
- Conflicts can arise in joint control scenarios.
- Limited ability to produce complex 3D camera movements.
- Issues with overlapping boxes in object control.
**Conclusion:**
Direct-a-Video provides a flexible and efficient tool for creative video synthesis with customized motion, addressing the need for independent and user-directed control over camera and object motion.