GFlow: Recovering 4D World from Monocular Video

GFlow: Recovering 4D World from Monocular Video

28 May 2024 | Shizun Wang, Xingyi Yang, Qiuqiong Shen, Zhenxiang Jiang, Xinchao Wang
GFlow is a novel framework for reconstructing 4D worlds from monocular video inputs without requiring camera parameters. It uses 2D priors (depth and optical flow) to lift a video into a 4D explicit representation through a flow of Gaussian splatting in space and time. The framework first clusters the scene into still and moving parts, then applies a sequential optimization process to refine camera poses and 3D Gaussian points based on 2D priors and scene clustering. This ensures fidelity among neighboring points and smooth movement across frames. GFlow also enables tracking of points across frames, segmentation of moving objects, and rendering of novel views. It can be used for scene-level or object-level editing, showcasing its versatility and power. The framework is evaluated on the DAVIS and Tanks and Temples datasets, demonstrating high-quality video reconstruction and accurate camera pose estimation. GFlow outperforms existing methods in reconstruction quality and camera pose accuracy, and it enables various downstream applications such as tracking, segmentation, and editing. The framework is capable of handling dynamic scenes with complex spatial relationships and temporal coherence. However, it relies on off-the-shelf depth and optical flow components, which may introduce inaccuracies. The framework also uses K-Means clustering for scene clustering, which may be inadequate for complex scenarios. Overall, GFlow provides a powerful and flexible approach for 4D reconstruction from monocular video inputs.GFlow is a novel framework for reconstructing 4D worlds from monocular video inputs without requiring camera parameters. It uses 2D priors (depth and optical flow) to lift a video into a 4D explicit representation through a flow of Gaussian splatting in space and time. The framework first clusters the scene into still and moving parts, then applies a sequential optimization process to refine camera poses and 3D Gaussian points based on 2D priors and scene clustering. This ensures fidelity among neighboring points and smooth movement across frames. GFlow also enables tracking of points across frames, segmentation of moving objects, and rendering of novel views. It can be used for scene-level or object-level editing, showcasing its versatility and power. The framework is evaluated on the DAVIS and Tanks and Temples datasets, demonstrating high-quality video reconstruction and accurate camera pose estimation. GFlow outperforms existing methods in reconstruction quality and camera pose accuracy, and it enables various downstream applications such as tracking, segmentation, and editing. The framework is capable of handling dynamic scenes with complex spatial relationships and temporal coherence. However, it relies on off-the-shelf depth and optical flow components, which may introduce inaccuracies. The framework also uses K-Means clustering for scene clustering, which may be inadequate for complex scenarios. Overall, GFlow provides a powerful and flexible approach for 4D reconstruction from monocular video inputs.
Reach us at info@study.space