23 May 2024 | Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki
DreamScene4D is a novel approach for generating 3D dynamic scenes from monocular videos, enabling the synthesis of novel views for multiple objects with fast motion. The method introduces a "decompose-recompose" strategy, decomposing the video into background and object tracks, and further factorizing object motion into three components: object-centric deformation, object-to-world-frame transformation, and camera motion. This decomposition allows rendering error gradients and object view-predictive models to recover object 3D completions and deformations, while bounding box tracks guide large object movements. The method achieves significant improvements over existing state-of-the-art video-to-4D generation approaches on challenging datasets such as DAVIS, Kubric, and self-captured videos. DreamScene4D also produces accurate 2D persistent point tracks by projecting inferred 3D trajectories to 2D. The method is evaluated using quantitative comparisons and a user preference study, demonstrating its effectiveness in generating realistic 4D scene representations. The approach is implemented using Gaussian Splatting and leverages powerful foundation models to generalize to diverse zero-shot settings. The method is able to handle complex multi-object scenes with large object motions and provides temporally consistent 4D scene generation results. The method is also evaluated on self-captured videos with fast object motion, showing robustness under challenging conditions. The results demonstrate that DreamScene4D outperforms existing methods in terms of motion accuracy and 4D scene generation quality. The method is able to generate consistent and faithful renders for fast-moving objects, while existing methods produce distorted 3D geometry, blurring, or broken artifacts. The method is also able to generate motion trajectories in arbitrary camera views, highlighting its applicability to real-world complex videos. The method is limited in its ability to generalize to videos captured from cameras with steep elevation angles and may fall into local optima if the rendered depth of the lifted 3D objects is not well aligned with the estimated depth. The method is also under-constrained when heavy occlusions happen, leading to potential artifacts. The method's runtime scales linearly with the number of objects and can be slow for complex videos. The method is able to generate accurate 3D point motion in the visible reference view and provides robust motion tracks in synthesized novel views. The method is supported by Toyota Research Institute.DreamScene4D is a novel approach for generating 3D dynamic scenes from monocular videos, enabling the synthesis of novel views for multiple objects with fast motion. The method introduces a "decompose-recompose" strategy, decomposing the video into background and object tracks, and further factorizing object motion into three components: object-centric deformation, object-to-world-frame transformation, and camera motion. This decomposition allows rendering error gradients and object view-predictive models to recover object 3D completions and deformations, while bounding box tracks guide large object movements. The method achieves significant improvements over existing state-of-the-art video-to-4D generation approaches on challenging datasets such as DAVIS, Kubric, and self-captured videos. DreamScene4D also produces accurate 2D persistent point tracks by projecting inferred 3D trajectories to 2D. The method is evaluated using quantitative comparisons and a user preference study, demonstrating its effectiveness in generating realistic 4D scene representations. The approach is implemented using Gaussian Splatting and leverages powerful foundation models to generalize to diverse zero-shot settings. The method is able to handle complex multi-object scenes with large object motions and provides temporally consistent 4D scene generation results. The method is also evaluated on self-captured videos with fast object motion, showing robustness under challenging conditions. The results demonstrate that DreamScene4D outperforms existing methods in terms of motion accuracy and 4D scene generation quality. The method is able to generate consistent and faithful renders for fast-moving objects, while existing methods produce distorted 3D geometry, blurring, or broken artifacts. The method is also able to generate motion trajectories in arbitrary camera views, highlighting its applicability to real-world complex videos. The method is limited in its ability to generalize to videos captured from cameras with steep elevation angles and may fall into local optima if the rendered depth of the lifted 3D objects is not well aligned with the estimated depth. The method is also under-constrained when heavy occlusions happen, leading to potential artifacts. The method's runtime scales linearly with the number of objects and can be slow for complex videos. The method is able to generate accurate 3D point motion in the visible reference view and provides robust motion tracks in synthesized novel views. The method is supported by Toyota Research Institute.