CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

4 Jun 2024 | Sifei Liu, Dejia Xu, Weili Nie, Chao Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat
**CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation** **Authors:** Sifei Liu, Dejia Xu, Weili Nie, Chao Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat **Institution:** University of Texas at Austin, NVIDIA **Abstract:** Video diffusion models have emerged as powerful tools for generating high-quality video content, but they often lack precise control over camera poses, limiting their expressive power and user control. To address this, CamCo introduces fine-grained camera pose control for image-to-video generation. By using Plücker coordinates to parameterize camera poses and integrating an epipolar attention module, CamCo enhances 3D consistency in generated videos. The model is fine-tuned on real-world videos with estimated camera poses, improving object motion synthesis. Experiments show that CamCo significantly improves 3D consistency and camera control compared to previous models, while effectively generating plausible object motion. **Contributions:** - Proposes CamCo, a novel camera-controllable image-to-video generation framework. - Adapts a pre-trained image-to-video diffusion model to incorporate camera control. - Introduces Plücker coordinates for precise camera pose representation. - Enhances geometric consistency with an epipolar constraint attention module. - Curates a dataset with annotated camera poses for better object motion generation. **Methods:** - **Image-to-Video Generation:** Uses a pre-trained model with camera information as conditioning. - **Camera Parameterization:** Utilizes Plücker coordinates to represent both camera intrinsics and extrinsics. - **Epipolar Constraint Attention (ECA):** Ensures geometric consistency by enforcing epipolar constraints across frames. - **Data Curation:** Annotates in-the-wild video frames with estimated camera poses to enhance object motion generation. **Experiments:** - **Baselines:** Compares against Stable Video Diffusion, VideoCrafter, and MotionCtrl. - **Metrics:** Evaluates camera pose accuracy, visual quality (FID, FVD), and object motion. - **Results:** Demonstrates superior performance in camera controllability, geometric consistency, and visual quality. **Limitations and Future Work:** - Current model generates images with the same camera intrinsic as the input. - Future work will explore generating longer and larger resolution videos. **Broad Impacts:** - The model can contain social biases, which may perpetuate stereotypes or misrepresentations in generated videos. A safety checker is provided to mitigate potential harm.**CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation** **Authors:** Sifei Liu, Dejia Xu, Weili Nie, Chao Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat **Institution:** University of Texas at Austin, NVIDIA **Abstract:** Video diffusion models have emerged as powerful tools for generating high-quality video content, but they often lack precise control over camera poses, limiting their expressive power and user control. To address this, CamCo introduces fine-grained camera pose control for image-to-video generation. By using Plücker coordinates to parameterize camera poses and integrating an epipolar attention module, CamCo enhances 3D consistency in generated videos. The model is fine-tuned on real-world videos with estimated camera poses, improving object motion synthesis. Experiments show that CamCo significantly improves 3D consistency and camera control compared to previous models, while effectively generating plausible object motion. **Contributions:** - Proposes CamCo, a novel camera-controllable image-to-video generation framework. - Adapts a pre-trained image-to-video diffusion model to incorporate camera control. - Introduces Plücker coordinates for precise camera pose representation. - Enhances geometric consistency with an epipolar constraint attention module. - Curates a dataset with annotated camera poses for better object motion generation. **Methods:** - **Image-to-Video Generation:** Uses a pre-trained model with camera information as conditioning. - **Camera Parameterization:** Utilizes Plücker coordinates to represent both camera intrinsics and extrinsics. - **Epipolar Constraint Attention (ECA):** Ensures geometric consistency by enforcing epipolar constraints across frames. - **Data Curation:** Annotates in-the-wild video frames with estimated camera poses to enhance object motion generation. **Experiments:** - **Baselines:** Compares against Stable Video Diffusion, VideoCrafter, and MotionCtrl. - **Metrics:** Evaluates camera pose accuracy, visual quality (FID, FVD), and object motion. - **Results:** Demonstrates superior performance in camera controllability, geometric consistency, and visual quality. **Limitations and Future Work:** - Current model generates images with the same camera intrinsic as the input. - Future work will explore generating longer and larger resolution videos. **Broad Impacts:** - The model can contain social biases, which may perpetuate stereotypes or misrepresentations in generated videos. A safety checker is provided to mitigate potential harm.
Reach us at info@study.space