[slides and audio] Diffusion Priors for Dynamic View Synthesis from Monocular Videos

The paper "Diffusion Priors for Dynamic View Synthesis from Monocular Videos" addresses the challenge of dynamic novel view synthesis, particularly in scenarios where camera poses are unknown or constrained compared to object motion. The authors propose a method that leverages diffusion priors to hallucinate unseen regions and handle self-occlusions, out-of-view details, and complex motions more effectively than existing methods. They fine-tune a pre-trained RGB-D diffusion model on video frames and distill its knowledge into 4D representations, combining dynamic and static Neural Radiance Fields (NeRFs). The proposed pipeline achieves geometric consistency and scene identity preservation. Extensive experiments on the iPhone dataset demonstrate the robustness and utility of the method, showing superior performance in qualitative and quantitative evaluations, including user studies. The method is evaluated using metrics like mLPIPS and mSSIM, but the authors note that these metrics do not fully capture the perceived quality of the synthesized views. The paper also includes ablation studies to validate the effectiveness of the proposed components and discusses limitations and future directions, such as computational efficiency and extending the method to unbounded scenes.The paper "Diffusion Priors for Dynamic View Synthesis from Monocular Videos" addresses the challenge of dynamic novel view synthesis, particularly in scenarios where camera poses are unknown or constrained compared to object motion. The authors propose a method that leverages diffusion priors to hallucinate unseen regions and handle self-occlusions, out-of-view details, and complex motions more effectively than existing methods. They fine-tune a pre-trained RGB-D diffusion model on video frames and distill its knowledge into 4D representations, combining dynamic and static Neural Radiance Fields (NeRFs). The proposed pipeline achieves geometric consistency and scene identity preservation. Extensive experiments on the iPhone dataset demonstrate the robustness and utility of the method, showing superior performance in qualitative and quantitative evaluations, including user studies. The method is evaluated using metrics like mLPIPS and mSSIM, but the authors note that these metrics do not fully capture the perceived quality of the synthesized views. The paper also includes ablation studies to validate the effectiveness of the proposed components and discusses limitations and future directions, such as computational efficiency and extending the method to unbounded scenes.

Diffusion Priors for Dynamic View Synthesis from Monocular Videos

10 Jan 2024 | Chaoyang Wang1 Peiye Zhuang1 Aliaksandr Siarohin1 Junli Cao1 Guocheng Qian1,2 Hsin-Ying Lee1 Sergey Tulyakov1