29 Nov 2024 | Jiahui Lei¹ Yijia Weng² Adam W. Harley² Leonidas Guibas² Kostas Daniilidis¹,³
MoSca is a 4D reconstruction system that reconstructs and synthesizes novel views of dynamic scenes from monocular videos captured in the wild. It leverages prior knowledge from foundational vision models and represents motion using a novel Motion Scaffold (MoSca) representation, which compactly encodes underlying motions and deformations. The system disentangles scene geometry and appearance from deformation fields, encoding them through globally fused Gaussians. Camera focal length and poses are determined via bundle adjustment without requiring additional pose estimation tools. MoSca achieves state-of-the-art performance on dynamic rendering benchmarks and works effectively on real videos. The system reconstructs 4D scenes by lifting 2D foundational models to 3D, optimizing deformation fields using physics-inspired regularization, and fusing observations across input videos. It also estimates camera poses and focal lengths through photometric optimization. MoSca's structured deformation representation and efficient Gaussian-based dynamic scene representation enable global fusion of observations and rendering from any query time. The system is fully automated and works with pose-free monocular videos, demonstrating robust performance on challenging tasks like dynamic scene reconstruction and novel view synthesis. MoSca outperforms existing methods on benchmark datasets, including DyCheck and NVIDIA, achieving high PSNR and competitive LPIPS results. It also provides accurate camera pose estimation and correspondence tracking, with ablation studies showing the importance of geometric and photometric optimization. MoSca has applications in real-world 4D reconstruction, enabling tasks such as removing moving foregrounds, occluders, and editing 4D videos. The system is limited by its reliance on accurate 2D tracks and depth estimation, and future work may focus on incorporating large-scale priors and modeling lighting effects. Overall, MoSca represents a significant step forward in reconstructing and rendering dynamic scenes from monocular videos.MoSca is a 4D reconstruction system that reconstructs and synthesizes novel views of dynamic scenes from monocular videos captured in the wild. It leverages prior knowledge from foundational vision models and represents motion using a novel Motion Scaffold (MoSca) representation, which compactly encodes underlying motions and deformations. The system disentangles scene geometry and appearance from deformation fields, encoding them through globally fused Gaussians. Camera focal length and poses are determined via bundle adjustment without requiring additional pose estimation tools. MoSca achieves state-of-the-art performance on dynamic rendering benchmarks and works effectively on real videos. The system reconstructs 4D scenes by lifting 2D foundational models to 3D, optimizing deformation fields using physics-inspired regularization, and fusing observations across input videos. It also estimates camera poses and focal lengths through photometric optimization. MoSca's structured deformation representation and efficient Gaussian-based dynamic scene representation enable global fusion of observations and rendering from any query time. The system is fully automated and works with pose-free monocular videos, demonstrating robust performance on challenging tasks like dynamic scene reconstruction and novel view synthesis. MoSca outperforms existing methods on benchmark datasets, including DyCheck and NVIDIA, achieving high PSNR and competitive LPIPS results. It also provides accurate camera pose estimation and correspondence tracking, with ablation studies showing the importance of geometric and photometric optimization. MoSca has applications in real-world 4D reconstruction, enabling tasks such as removing moving foregrounds, occluders, and editing 4D videos. The system is limited by its reliance on accurate 2D tracks and depth estimation, and future work may focus on incorporating large-scale priors and modeling lighting effects. Overall, MoSca represents a significant step forward in reconstructing and rendering dynamic scenes from monocular videos.