17 May 2025 | Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang
MoDGS is a novel method for rendering novel views of dynamic scenes from casually captured monocular videos. The method addresses the challenge of reconstructing dynamic scenes from videos with static or slowly moving cameras, which are difficult for existing methods due to weak multiview consistency. MoDGS leverages recent single-view depth estimation techniques to guide the learning of dynamic scenes and introduces a 3D-aware initialization method to learn a reasonable deformation field and a new robust depth loss to guide the learning of dynamic scene geometry. Comprehensive experiments show that MoDGS outperforms state-of-the-art methods in rendering high-quality novel view images from casually captured monocular videos. The method uses a 3D-aware initialization scheme to initialize the deformation field and a novel ordinal depth loss to ensure consistent depth outputs. MoDGS is trained on three widely used datasets, including the Nvidia, DyNeRF, and Davis datasets, and a self-collected dataset of monocular in-the-wild videos. The method achieves high-quality novel view synthesis on casually captured monocular videos, demonstrating its effectiveness in real-world scenarios. The proposed method introduces two key innovations: a 3D-aware initialization scheme and an ordinal depth loss. The 3D-aware initialization provides a strong basis for subsequent optimization, while the ordinal depth loss enables accurate reconstruction of dynamic scenes by ensuring the order of depth values is correct. The method is evaluated on multiple datasets and shows superior performance in terms of PSNR, SSIM, and LPIPS metrics. The method is also compared with other baseline methods, including NeRF-based and Gaussian splatting-based approaches, and demonstrates its effectiveness in rendering novel views from casually captured monocular videos. The method is limited in its ability to reconstruct unseen parts of scenes and relies on single-view depth estimators for accurate depth maps. However, it is effective in reconstructing visible parts of scenes and produces high-quality novel view images. The method is also limited in its ability to handle rapidly moving scenes and scenes with heavy specular reflections and low-light conditions. Despite these limitations, MoDGS demonstrates its effectiveness in rendering novel views from casually captured monocular videos and outperforms existing methods in terms of performance and quality.MoDGS is a novel method for rendering novel views of dynamic scenes from casually captured monocular videos. The method addresses the challenge of reconstructing dynamic scenes from videos with static or slowly moving cameras, which are difficult for existing methods due to weak multiview consistency. MoDGS leverages recent single-view depth estimation techniques to guide the learning of dynamic scenes and introduces a 3D-aware initialization method to learn a reasonable deformation field and a new robust depth loss to guide the learning of dynamic scene geometry. Comprehensive experiments show that MoDGS outperforms state-of-the-art methods in rendering high-quality novel view images from casually captured monocular videos. The method uses a 3D-aware initialization scheme to initialize the deformation field and a novel ordinal depth loss to ensure consistent depth outputs. MoDGS is trained on three widely used datasets, including the Nvidia, DyNeRF, and Davis datasets, and a self-collected dataset of monocular in-the-wild videos. The method achieves high-quality novel view synthesis on casually captured monocular videos, demonstrating its effectiveness in real-world scenarios. The proposed method introduces two key innovations: a 3D-aware initialization scheme and an ordinal depth loss. The 3D-aware initialization provides a strong basis for subsequent optimization, while the ordinal depth loss enables accurate reconstruction of dynamic scenes by ensuring the order of depth values is correct. The method is evaluated on multiple datasets and shows superior performance in terms of PSNR, SSIM, and LPIPS metrics. The method is also compared with other baseline methods, including NeRF-based and Gaussian splatting-based approaches, and demonstrates its effectiveness in rendering novel views from casually captured monocular videos. The method is limited in its ability to reconstruct unseen parts of scenes and relies on single-view depth estimators for accurate depth maps. However, it is effective in reconstructing visible parts of scenes and produces high-quality novel view images. The method is also limited in its ability to handle rapidly moving scenes and scenes with heavy specular reflections and low-light conditions. Despite these limitations, MoDGS demonstrates its effectiveness in rendering novel views from casually captured monocular videos and outperforms existing methods in terms of performance and quality.