5 Jul 2024 | Basile Van Hoorick¹, Rundi Wu¹, Ege Ozguroglu¹, Kyle Sargent², Ruoshi Liu¹, Pavel Tokmakov³, Achal Dave³, Changxi Zheng¹, and Carl Vondrick¹
This paper introduces GCD, a controllable monocular dynamic view synthesis pipeline that generates videos from any chosen perspective based on a set of relative camera pose parameters. The model leverages large-scale diffusion priors to perform end-to-end video-to-video translation without requiring depth input or explicit 3D scene geometry modeling. Despite being trained on synthetic multi-view video data, the model shows promising zero-shot generalization in multiple domains, including robotics, object permanence, and driving environments. The framework enables powerful applications in dynamic scene understanding, robotics perception, and interactive 3D video viewing experiences for virtual reality. The approach addresses the challenge of dynamic novel view synthesis by using a pre-trained video diffusion model and fine-tuning it with paired synthetic videos. The model is evaluated on two datasets, Kubric-4D and ParallelDomain-4D, and demonstrates superior performance in both RGB and semantic space. The results show that the model can generate high-quality videos from novel viewpoints, even with extreme camera transformations. The framework is capable of revealing previously unseen parts of dynamic scenes, including occluded objects and 'stuff' regions. The model outperforms existing methods in terms of visual quality and consistency, and shows strong generalization to real-world scenarios. The paper also discusses the choice of camera trajectory and the impact of different training configurations on the model's performance. Overall, the proposed framework represents a significant advancement in dynamic view synthesis and has the potential to unlock new applications in computer vision and related fields.This paper introduces GCD, a controllable monocular dynamic view synthesis pipeline that generates videos from any chosen perspective based on a set of relative camera pose parameters. The model leverages large-scale diffusion priors to perform end-to-end video-to-video translation without requiring depth input or explicit 3D scene geometry modeling. Despite being trained on synthetic multi-view video data, the model shows promising zero-shot generalization in multiple domains, including robotics, object permanence, and driving environments. The framework enables powerful applications in dynamic scene understanding, robotics perception, and interactive 3D video viewing experiences for virtual reality. The approach addresses the challenge of dynamic novel view synthesis by using a pre-trained video diffusion model and fine-tuning it with paired synthetic videos. The model is evaluated on two datasets, Kubric-4D and ParallelDomain-4D, and demonstrates superior performance in both RGB and semantic space. The results show that the model can generate high-quality videos from novel viewpoints, even with extreme camera transformations. The framework is capable of revealing previously unseen parts of dynamic scenes, including occluded objects and 'stuff' regions. The model outperforms existing methods in terms of visual quality and consistency, and shows strong generalization to real-world scenarios. The paper also discusses the choice of camera trajectory and the impact of different training configurations on the model's performance. Overall, the proposed framework represents a significant advancement in dynamic view synthesis and has the potential to unlock new applications in computer vision and related fields.