20 Jul 2024 | Sherwin Bahmani¹,²,³ Ivan Skorokhodov³ Aliaksandr Siarohin³ Willi Menapace³ Guocheng Qian³ Michael Vasilkovsky³ Hsin-Ying Lee³ Chaoyang Wang³ Jiaxu Zou³ Andrea Tagliasacchi¹,⁴ David B. Lindell¹,² Sergey Tulyakov³
VD3D is a method for controlling 3D camera motion in text-to-video generation using video diffusion transformers. The approach introduces a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings derived from Plücker coordinates. This enables fine-grained control over camera poses during video generation, allowing for the synthesis of complex scenes from varied viewpoints. The method is evaluated on the RealEstate10K dataset and demonstrates state-of-the-art performance in terms of camera controllability and video quality. It also enables downstream applications such as multi-view text-to-video generation. The approach is designed for large video transformers that process spatiotemporal information jointly, and it addresses the limitations of existing methods that lack effective camera control. The method involves adapting the SnapVideo model with a novel conditioning mechanism that allows for efficient fine-tuning and maintains visual quality. The results show that the proposed method outperforms existing approaches in camera control and video generation quality, and it provides a more flexible and controllable framework for text-to-video generation. The work also highlights the importance of spatiotemporal camera conditioning in video generation and demonstrates the potential of large video transformers for 3D camera control.VD3D is a method for controlling 3D camera motion in text-to-video generation using video diffusion transformers. The approach introduces a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings derived from Plücker coordinates. This enables fine-grained control over camera poses during video generation, allowing for the synthesis of complex scenes from varied viewpoints. The method is evaluated on the RealEstate10K dataset and demonstrates state-of-the-art performance in terms of camera controllability and video quality. It also enables downstream applications such as multi-view text-to-video generation. The approach is designed for large video transformers that process spatiotemporal information jointly, and it addresses the limitations of existing methods that lack effective camera control. The method involves adapting the SnapVideo model with a novel conditioning mechanism that allows for efficient fine-tuning and maintains visual quality. The results show that the proposed method outperforms existing approaches in camera control and video generation quality, and it provides a more flexible and controllable framework for text-to-video generation. The work also highlights the importance of spatiotemporal camera conditioning in video generation and demonstrates the potential of large video transformers for 3D camera control.