2024 | Titus Anciukevičius, Fabian Manhardt, Federico Tombari, Paul Henderson
This paper introduces a novel denoising diffusion model, GIBR (Generative Image-Based Rendering), for fast and detailed reconstruction and generation of real-world 3D scenes. The model addresses three key challenges in 3D scene generation: (1) representing large or unbounded scenes with a neural scene representation that can dynamically allocate capacity to capture details visible in each image; (2) learning a prior over this novel 3D scene representation using only 2D images without additional supervision; and (3) avoiding trivial 3D solutions when integrating image-based rendering with the diffusion model by dropping out representations of some images. The model uses a new neural scene representation called IB-planes, which incorporates information from multiple images and adds depth and polar features. It also proposes a joint multi-view denoising framework that supports unconditional generation and reconstruction of 3D scenes from varying numbers of images. The model is evaluated on several challenging datasets of real and synthetic images, demonstrating superior results in generation, novel view synthesis, and 3D reconstruction. The model outputs explicit representations of 3D scenes that can be rendered at resolutions up to 1024x1024. The paper also discusses related work, including traditional 3D reconstruction methods, neural fields, and generative models such as GANs, VAEs, and diffusion models. The authors compare their method to other diffusion-based approaches and show that their model outperforms existing methods in terms of PSNR, SSIM, LPIPS, and DRC metrics. The model is also evaluated on unconditional generation of 3D scenes, showing significant improvements over baselines in terms of Fréchet Inception Distance (FID) metrics. The paper concludes that the proposed method provides a powerful new approach to 3D scene generation and reconstruction, capable of being trained from multi-view images without 3D supervision.This paper introduces a novel denoising diffusion model, GIBR (Generative Image-Based Rendering), for fast and detailed reconstruction and generation of real-world 3D scenes. The model addresses three key challenges in 3D scene generation: (1) representing large or unbounded scenes with a neural scene representation that can dynamically allocate capacity to capture details visible in each image; (2) learning a prior over this novel 3D scene representation using only 2D images without additional supervision; and (3) avoiding trivial 3D solutions when integrating image-based rendering with the diffusion model by dropping out representations of some images. The model uses a new neural scene representation called IB-planes, which incorporates information from multiple images and adds depth and polar features. It also proposes a joint multi-view denoising framework that supports unconditional generation and reconstruction of 3D scenes from varying numbers of images. The model is evaluated on several challenging datasets of real and synthetic images, demonstrating superior results in generation, novel view synthesis, and 3D reconstruction. The model outputs explicit representations of 3D scenes that can be rendered at resolutions up to 1024x1024. The paper also discusses related work, including traditional 3D reconstruction methods, neural fields, and generative models such as GANs, VAEs, and diffusion models. The authors compare their method to other diffusion-based approaches and show that their model outperforms existing methods in terms of PSNR, SSIM, LPIPS, and DRC metrics. The model is also evaluated on unconditional generation of 3D scenes, showing significant improvements over baselines in terms of Fréchet Inception Distance (FID) metrics. The paper concludes that the proposed method provides a powerful new approach to 3D scene generation and reconstruction, capable of being trained from multi-view images without 3D supervision.