2 Feb 2024 | Jiawei Wang*, Yuchen Zhang*, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li
Boximator is a novel approach for generating rich and controllable motions in video synthesis. It introduces two types of constraints: hard boxes and soft boxes. Hard boxes precisely define an object's bounding box, while soft boxes define a broader region where the object must reside. Boximator functions as a plug-in for existing video diffusion models, preserving the base model's knowledge by freezing its weights and training only the control module. A self-tracking technique simplifies learning box-object correlations, enabling the model to generate colored bounding boxes as part of the video. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, outperforming two base models and further improving with box constraints. Human evaluations show that users prefer Boximator's results over the base model. Boximator allows users to control object motion through hard and soft boxes, enabling precise motion control without relying on text prompts. It can be applied to any video diffusion model without modifying the original weights, and its performance improves with evolving base models. The self-tracking method significantly simplifies the training of the control module. Boximator's ability to handle complex scenarios is demonstrated through case studies, showing its effectiveness in generating videos with precise motion control. The model's performance is validated through extensive experiments on datasets like MSR-VTT and ActivityNet, where it achieves higher average precision (AP) scores for motion control. Boximator's results highlight its effectiveness in generating high-quality videos with controllable motion, making it a valuable tool for video synthesis. Ethical and social risks associated with video generation technologies, including the potential for deepfakes and biases, are also discussed.Boximator is a novel approach for generating rich and controllable motions in video synthesis. It introduces two types of constraints: hard boxes and soft boxes. Hard boxes precisely define an object's bounding box, while soft boxes define a broader region where the object must reside. Boximator functions as a plug-in for existing video diffusion models, preserving the base model's knowledge by freezing its weights and training only the control module. A self-tracking technique simplifies learning box-object correlations, enabling the model to generate colored bounding boxes as part of the video. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, outperforming two base models and further improving with box constraints. Human evaluations show that users prefer Boximator's results over the base model. Boximator allows users to control object motion through hard and soft boxes, enabling precise motion control without relying on text prompts. It can be applied to any video diffusion model without modifying the original weights, and its performance improves with evolving base models. The self-tracking method significantly simplifies the training of the control module. Boximator's ability to handle complex scenarios is demonstrated through case studies, showing its effectiveness in generating videos with precise motion control. The model's performance is validated through extensive experiments on datasets like MSR-VTT and ActivityNet, where it achieves higher average precision (AP) scores for motion control. Boximator's results highlight its effectiveness in generating high-quality videos with controllable motion, making it a valuable tool for video synthesis. Ethical and social risks associated with video generation technologies, including the potential for deepfakes and biases, are also discussed.