20 Feb 2024 | Titas Anciukevičius, Fabian Manhardt, Federico Tombari, Paul Henderson
The paper introduces the first diffusion model capable of generating and reconstructing large-scale, detailed 3D scenes from real-world images. The authors address three key challenges: the lack of expressive 3D scene representations, the scarcity of real-world 3D datasets, and the difficulty of sampling from the true posterior distribution over complex scenes. To overcome these challenges, they propose a new neural scene representation called IB-planes, which can efficiently and accurately represent large 3D scenes by dynamically allocating more capacity as needed. They also develop a denoising-diffusion framework that learns a prior over this novel 3D scene representation using only 2D images without additional supervision. Additionally, they present a principled approach to avoid trivial 3D solutions by dropping out representations of some images. The model is evaluated on several challenging datasets and demonstrates superior results in generation, novel view synthesis, and 3D reconstruction.The paper introduces the first diffusion model capable of generating and reconstructing large-scale, detailed 3D scenes from real-world images. The authors address three key challenges: the lack of expressive 3D scene representations, the scarcity of real-world 3D datasets, and the difficulty of sampling from the true posterior distribution over complex scenes. To overcome these challenges, they propose a new neural scene representation called IB-planes, which can efficiently and accurately represent large 3D scenes by dynamically allocating more capacity as needed. They also develop a denoising-diffusion framework that learns a prior over this novel 3D scene representation using only 2D images without additional supervision. Additionally, they present a principled approach to avoid trivial 3D solutions by dropping out representations of some images. The model is evaluated on several challenging datasets and demonstrates superior results in generation, novel view synthesis, and 3D reconstruction.