Diffusion Priors for Dynamic View Synthesis from Monocular Videos

Diffusion Priors for Dynamic View Synthesis from Monocular Videos

10 Jan 2024 | Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, Sergey Tulyakov
This paper introduces DpDy, a novel method for dynamic novel view synthesis from monocular videos. The approach leverages diffusion priors to address challenges in reconstructing dynamic scenes, particularly in handling self-occlusions, out-of-view details, and complex motions. The method fine-tunes a pre-trained RGB-D diffusion model on video frames and distills knowledge into a 4D representation combining dynamic and static Neural Radiance Fields (NeRF) components. This enables geometric consistency while preserving scene identity. The proposed method is evaluated on the iPhone dataset, demonstrating superior performance compared to existing methods like T-NeRF, NSFF, Nerfies, HyperNeRF, and RoDynRF. Qualitative and quantitative experiments show that DpDy produces high-quality results with fewer artifacts and better visual realism. User studies further confirm the effectiveness of the method in generating realistic dynamic views. The method integrates RGB-D diffusion priors to provide guidance for novel view synthesis, offering direct geometry supervision and enabling more robust 4D scene reconstruction. It also incorporates a regularization loss to ensure proper decomposition of dynamic foreground and static background, and uses a score distillation sampling (SDS) loss to enhance the quality of novel views. The method is implemented using a 4D representation with two separate NeRFs: one for rigid regions and another for dynamic regions. The rendering process involves blending the outputs of these two NeRFs, and the method uses reconstruction losses for images and depth maps to minimize differences between rendered images and reference video frames. Additionally, an SDS loss is used to provide guidance for novel dynamic view synthesis. The method is trained on the iPhone dataset, which contains 14 videos with diverse and complex motions. The results show that DpDy outperforms existing methods in terms of visual quality and realism, particularly in dynamic foregrounds. The method is also effective in handling occluded regions and complex object motions, producing more realistic and consistent results compared to baseline methods. The method has limitations, including the need for high-end GPUs for training and the current resolution constraints. Future work could explore more efficient representations and lighter diffusion models to improve performance. The method is also limited to bounded dynamic scenes, and extending it to unbounded scenes could be achieved through progressive grid combination or image-conditioned rendering.This paper introduces DpDy, a novel method for dynamic novel view synthesis from monocular videos. The approach leverages diffusion priors to address challenges in reconstructing dynamic scenes, particularly in handling self-occlusions, out-of-view details, and complex motions. The method fine-tunes a pre-trained RGB-D diffusion model on video frames and distills knowledge into a 4D representation combining dynamic and static Neural Radiance Fields (NeRF) components. This enables geometric consistency while preserving scene identity. The proposed method is evaluated on the iPhone dataset, demonstrating superior performance compared to existing methods like T-NeRF, NSFF, Nerfies, HyperNeRF, and RoDynRF. Qualitative and quantitative experiments show that DpDy produces high-quality results with fewer artifacts and better visual realism. User studies further confirm the effectiveness of the method in generating realistic dynamic views. The method integrates RGB-D diffusion priors to provide guidance for novel view synthesis, offering direct geometry supervision and enabling more robust 4D scene reconstruction. It also incorporates a regularization loss to ensure proper decomposition of dynamic foreground and static background, and uses a score distillation sampling (SDS) loss to enhance the quality of novel views. The method is implemented using a 4D representation with two separate NeRFs: one for rigid regions and another for dynamic regions. The rendering process involves blending the outputs of these two NeRFs, and the method uses reconstruction losses for images and depth maps to minimize differences between rendered images and reference video frames. Additionally, an SDS loss is used to provide guidance for novel dynamic view synthesis. The method is trained on the iPhone dataset, which contains 14 videos with diverse and complex motions. The results show that DpDy outperforms existing methods in terms of visual quality and realism, particularly in dynamic foregrounds. The method is also effective in handling occluded regions and complex object motions, producing more realistic and consistent results compared to baseline methods. The method has limitations, including the need for high-end GPUs for training and the current resolution constraints. Future work could explore more efficient representations and lighter diffusion models to improve performance. The method is also limited to bounded dynamic scenes, and extending it to unbounded scenes could be achieved through progressive grid combination or image-conditioned rendering.
Reach us at info@study.space
Understanding Diffusion Priors for Dynamic View Synthesis from Monocular Videos