[slides] Boximator%3A Generating Rich and Controllable Motions for Video Synthesis

**Boximator: Generating Rich and Controllable Motions for Video Synthesis** **Authors:** Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li **Affiliation:** ByteDance Research **Abstract:** Generating rich and controllable motion is a pivotal challenge in video synthesis. Boximator introduces a new approach for fine-grained motion control using two types of constraints: hard boxes and soft boxes. Users can select objects in the conditional frame using hard boxes and then define their position, shape, or motion path in future frames using either type of boxes. Boximator functions as a plug-in for existing video diffusion models, training only the control module while freezing the base model's weights. To address training challenges, a novel self-tracking technique is introduced, simplifying the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhancing with box constraints. Human evaluation also shows that users favor Boximator generation results over the base model. **Introduction:** Video synthesis has seen significant advancements, with recent research focusing on enhancing controllability through frame-level constraints. Boximator introduces box-shaped constraints as a universal mechanism for fine-grained motion control, allowing users to control multiple objects across frames by associating unique object IDs with boxes. The method is flexible, supports both foreground and background objects, and can modify the pose of larger objects. It is particularly useful when generation is conditioned on an image, as it allows users to select objects by drawing hard boxes around them. For frames without user-defined boxes, Boximator allows approximate motion path control via algorithm-generated soft boxes. **Related Work:** Recent advancements in video diffusion models have focused on improving controllability, with methods like TrailBlazer and FACTOR. However, these methods lack precise object definition and soft box constraints, making them less effective for complex scenarios. **Background:** Boximator is built on top of video diffusion models using the 3D U-Net architecture. The model iteratively predicts noise vectors to transform Gaussian noise into high-quality video frames. **Boximator: Box-guided Motion Control:** - **Model Architecture:** Boximator adds a new self-attention layer to the spatial attention blocks of video diffusion models to incorporate box constraints. - **Data Pipeline:** An automatic data annotation pipeline generates 1.1M highly dynamic video clips with 2.2M annotated objects from the WebVid-10M dataset. - **Self-Tracking:** A novel training technique trains the model to generate colored bounding boxes, simplifying the learning of box-object correlations. - **Multi-Stage Training Procedure:** The model is trained in three stages, gradually increasing the complexity of box constraints. **Experiments:** - **Experiment Settings:** Boximator is trained on two base models: PixelDance and Model**Boximator: Generating Rich and Controllable Motions for Video Synthesis** **Authors:** Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li **Affiliation:** ByteDance Research **Abstract:** Generating rich and controllable motion is a pivotal challenge in video synthesis. Boximator introduces a new approach for fine-grained motion control using two types of constraints: hard boxes and soft boxes. Users can select objects in the conditional frame using hard boxes and then define their position, shape, or motion path in future frames using either type of boxes. Boximator functions as a plug-in for existing video diffusion models, training only the control module while freezing the base model's weights. To address training challenges, a novel self-tracking technique is introduced, simplifying the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhancing with box constraints. Human evaluation also shows that users favor Boximator generation results over the base model. **Introduction:** Video synthesis has seen significant advancements, with recent research focusing on enhancing controllability through frame-level constraints. Boximator introduces box-shaped constraints as a universal mechanism for fine-grained motion control, allowing users to control multiple objects across frames by associating unique object IDs with boxes. The method is flexible, supports both foreground and background objects, and can modify the pose of larger objects. It is particularly useful when generation is conditioned on an image, as it allows users to select objects by drawing hard boxes around them. For frames without user-defined boxes, Boximator allows approximate motion path control via algorithm-generated soft boxes. **Related Work:** Recent advancements in video diffusion models have focused on improving controllability, with methods like TrailBlazer and FACTOR. However, these methods lack precise object definition and soft box constraints, making them less effective for complex scenarios. **Background:** Boximator is built on top of video diffusion models using the 3D U-Net architecture. The model iteratively predicts noise vectors to transform Gaussian noise into high-quality video frames. **Boximator: Box-guided Motion Control:** - **Model Architecture:** Boximator adds a new self-attention layer to the spatial attention blocks of video diffusion models to incorporate box constraints. - **Data Pipeline:** An automatic data annotation pipeline generates 1.1M highly dynamic video clips with 2.2M annotated objects from the WebVid-10M dataset. - **Self-Tracking:** A novel training technique trains the model to generate colored bounding boxes, simplifying the learning of box-object correlations. - **Multi-Stage Training Procedure:** The model is trained in three stages, gradually increasing the complexity of box constraints. **Experiments:** - **Experiment Settings:** Boximator is trained on two base models: PixelDance and Model

Boximator: Generating Rich and Controllable Motions for Video Synthesis

2 Feb 2024 | Jiawei Wang*, Yuchen Zhang*, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li

2 Feb 2024 | Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li