[slides and audio] Flash3D%3A Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

**Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image** **Authors:** Stanisław Szymanowicz **Abstract:** This paper introduces Flash3D, a method for scene reconstruction and novel view synthesis from a single image, which is both highly generalisable and efficient. Flash3D starts from a 'foundation' model for monocular depth estimation and extends it to a full 3D shape and appearance reconstructor. The method is based on feed-forward Gaussian Splatting, predicting a first layer of 3D Gaussians at the predicted depth and then adding additional layers of Gaussians offset in space, allowing the model to reconstruct occluded and truncated parts of the scene. Flash3D is trained on a single GPU in a day, making it accessible to most researchers. It achieves state-of-the-art results on the RealEstate10k dataset and outperforms competitors on unseen datasets like NYU and KITTI, even surpassing methods trained specifically for those datasets. **Introduction:** The problem of reconstructing photorealistic 3D scenes from a single image is challenging due to the lack of geometric cues and the complexity of scenes. Monocular depth estimation, a mature area, provides accurate metric depth but does not offer appearance information or handle occluded regions. Flash3D addresses these limitations by leveraging a high-quality monocular depth predictor as a foundation and using feed-forward Gaussian Splatting to model the scene's geometry and appearance. **Method:** Flash3D uses a pre-trained monocular depth predictor to estimate metric depth. An additional network predicts a set of shape and appearance parameters for multiple layers of Gaussians for each pixel, allowing the model to handle occlusions and out-of-frame regions. The method is efficient and can be trained on a single GPU in a day. **Experiments:** Flash3D is evaluated on various datasets, including RealEstate10k, NYU, and KITTI. It demonstrates superior performance in cross-dataset generalisation, in-domain novel view synthesis, and even outperforms two-view methods in some tasks. Ablation studies and qualitative analyses provide insights into the effectiveness of each component of the method. **Conclusion:** Flash3D is a highly efficient and generalisable approach for monocular scene reconstruction, achieving state-of-the-art results and demonstrating strong performance on various datasets.**Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image** **Authors:** Stanisław Szymanowicz **Abstract:** This paper introduces Flash3D, a method for scene reconstruction and novel view synthesis from a single image, which is both highly generalisable and efficient. Flash3D starts from a 'foundation' model for monocular depth estimation and extends it to a full 3D shape and appearance reconstructor. The method is based on feed-forward Gaussian Splatting, predicting a first layer of 3D Gaussians at the predicted depth and then adding additional layers of Gaussians offset in space, allowing the model to reconstruct occluded and truncated parts of the scene. Flash3D is trained on a single GPU in a day, making it accessible to most researchers. It achieves state-of-the-art results on the RealEstate10k dataset and outperforms competitors on unseen datasets like NYU and KITTI, even surpassing methods trained specifically for those datasets. **Introduction:** The problem of reconstructing photorealistic 3D scenes from a single image is challenging due to the lack of geometric cues and the complexity of scenes. Monocular depth estimation, a mature area, provides accurate metric depth but does not offer appearance information or handle occluded regions. Flash3D addresses these limitations by leveraging a high-quality monocular depth predictor as a foundation and using feed-forward Gaussian Splatting to model the scene's geometry and appearance. **Method:** Flash3D uses a pre-trained monocular depth predictor to estimate metric depth. An additional network predicts a set of shape and appearance parameters for multiple layers of Gaussians for each pixel, allowing the model to handle occlusions and out-of-frame regions. The method is efficient and can be trained on a single GPU in a day. **Experiments:** Flash3D is evaluated on various datasets, including RealEstate10k, NYU, and KITTI. It demonstrates superior performance in cross-dataset generalisation, in-domain novel view synthesis, and even outperforms two-view methods in some tasks. Ablation studies and qualitative analyses provide insights into the effectiveness of each component of the method. **Conclusion:** Flash3D is a highly efficient and generalisable approach for monocular scene reconstruction, achieving state-of-the-art results and demonstrating strong performance on various datasets.

Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

6 Jun 2024 | Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi