VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

20 Jul 2024 | Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
The paper introduces VD3D, a method for controlling camera poses in text-to-video generation using large video diffusion transformers. The method leverages a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plücker coordinates. This approach enables fine-grained control over camera movement, which is crucial for applications in content creation, visual effects, and 3D vision. The authors evaluate their method on the RealEstate10K dataset and demonstrate state-of-the-art performance in controllable video generation. They also explore applications such as multi-view generation from real images, showcasing the potential of camera-conditioned image-to-multiview generation for complex 3D scene synthesis. The paper highlights the limitations and future directions, including the need for improved control over dynamic scenes and the potential for further architectural improvements.The paper introduces VD3D, a method for controlling camera poses in text-to-video generation using large video diffusion transformers. The method leverages a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plücker coordinates. This approach enables fine-grained control over camera movement, which is crucial for applications in content creation, visual effects, and 3D vision. The authors evaluate their method on the RealEstate10K dataset and demonstrate state-of-the-art performance in controllable video generation. They also explore applications such as multi-view generation from real images, showcasing the potential of camera-conditioned image-to-multiview generation for complex 3D scene synthesis. The paper highlights the limitations and future directions, including the need for improved control over dynamic scenes and the potential for further architectural improvements.
Reach us at info@study.space
Understanding VD3D%3A Taming Large Video Diffusion Transformers for 3D Camera Control