4 Jun 2024 | Sifei Liu², Deji Xu¹, Weili Nie², Chao Liu², Jan Kautz², Zhangyang Wang¹, Arash Vahdat²
CamCo is a camera-controllable image-to-video generation framework that produces 3D-consistent videos. The framework is built upon a pre-trained image-to-video diffusion model, preserving most of the original model parameters to maintain its generative capabilities. To enhance camera motion control, CamCo represents camera pose using Plücker coordinates, which encode both camera intrinsic and extrinsic parameters into a pixel-wise embedding. This allows for fine-grained control over camera motion. Additionally, an epipolar constraint attention module is introduced to enforce epipolar constraints across frames, ensuring geometric consistency in the generated videos. A data curation pipeline is also implemented to handle in-the-wild videos with dynamic subjects and fine-tune CamCo on the curated dataset to enhance its ability to generate videos with both camera ego-motion and dynamic subjects. Experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. The framework supports indoor, outdoor, object-centric, and text-to-image generated images. The project page is available at https://ir1d.github.io/CamCo/.CamCo is a camera-controllable image-to-video generation framework that produces 3D-consistent videos. The framework is built upon a pre-trained image-to-video diffusion model, preserving most of the original model parameters to maintain its generative capabilities. To enhance camera motion control, CamCo represents camera pose using Plücker coordinates, which encode both camera intrinsic and extrinsic parameters into a pixel-wise embedding. This allows for fine-grained control over camera motion. Additionally, an epipolar constraint attention module is introduced to enforce epipolar constraints across frames, ensuring geometric consistency in the generated videos. A data curation pipeline is also implemented to handle in-the-wild videos with dynamic subjects and fine-tune CamCo on the curated dataset to enhance its ability to generate videos with both camera ego-motion and dynamic subjects. Experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. The framework supports indoor, outdoor, object-centric, and text-to-image generated images. The project page is available at https://ir1d.github.io/CamCo/.