6 Jun 2024 | Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi
Flash3D is a method for reconstructing 3D scenes and generating novel views from a single image, offering high generalization and efficiency. It builds upon a pre-trained monocular depth estimator to create a full 3D shape and appearance reconstructor. The method uses feed-forward Gaussian splatting, predicting multiple layers of Gaussians to handle occlusions and truncations. Flash3D is trained on a single GPU in a day and achieves state-of-the-art results on RealEstate10k. It outperforms competitors on unseen datasets like NYU and KITTI, even surpassing methods trained specifically on those datasets. Flash3D is efficient, with a simple and highly-performant pipeline that can render high-quality images of reconstructed scenes and handle both indoor and outdoor environments. It also reconstructs occluded regions that depth estimation alone cannot. Flash3D is effective for monocular scene reconstruction, using a foundation model for generalization and multi-layer Gaussian splatting for efficient reconstruction. It achieves state-of-the-art results in novel view synthesis across various metrics. The method is evaluated on multiple datasets, showing strong performance in cross-domain generalization and in-domain reconstruction. Flash3D is also compared to few-view methods, demonstrating its effectiveness in extrapolation tasks. The model is trained using a pre-trained depth network, allowing it to learn strong shape and appearance priors. The method is efficient, with a single GPU training time of 16 hours, and is suitable for a wide range of researchers. Flash3D is a simple and effective approach for monocular scene reconstruction, achieving high accuracy and performance in various tasks.Flash3D is a method for reconstructing 3D scenes and generating novel views from a single image, offering high generalization and efficiency. It builds upon a pre-trained monocular depth estimator to create a full 3D shape and appearance reconstructor. The method uses feed-forward Gaussian splatting, predicting multiple layers of Gaussians to handle occlusions and truncations. Flash3D is trained on a single GPU in a day and achieves state-of-the-art results on RealEstate10k. It outperforms competitors on unseen datasets like NYU and KITTI, even surpassing methods trained specifically on those datasets. Flash3D is efficient, with a simple and highly-performant pipeline that can render high-quality images of reconstructed scenes and handle both indoor and outdoor environments. It also reconstructs occluded regions that depth estimation alone cannot. Flash3D is effective for monocular scene reconstruction, using a foundation model for generalization and multi-layer Gaussian splatting for efficient reconstruction. It achieves state-of-the-art results in novel view synthesis across various metrics. The method is evaluated on multiple datasets, showing strong performance in cross-domain generalization and in-domain reconstruction. Flash3D is also compared to few-view methods, demonstrating its effectiveness in extrapolation tasks. The model is trained using a pre-trained depth network, allowing it to learn strong shape and appearance priors. The method is efficient, with a single GPU training time of 16 hours, and is suitable for a wide range of researchers. Flash3D is a simple and effective approach for monocular scene reconstruction, achieving high accuracy and performance in various tasks.