[slides] InseRF%3A Text-Driven Generative Object Insertion in Neural 3D Scenes

InseRF is a novel method for generating and inserting objects into 3D scenes using neural radiance fields (NeRFs). The method addresses the challenge of inserting objects in arbitrary locations within 3D scenes, which existing methods often fail to achieve due to the multiview inconsistency of edits. InseRF leverages a user-provided textual description and a 2D bounding box in a reference viewpoint to generate and insert new objects. The process involves several key steps: 1. **2D View Generation**: A 2D view of the target object is generated in the reference viewpoint using a text-to-image diffusion model conditioned on the provided bounding box. 2. **3D Object Reconstruction**: The 2D view is used to reconstruct a 3D object NeRF, leveraging single-view object reconstruction methods. 3. **3D Placement**: The 3D placement of the object is estimated using monocular depth estimation, ensuring the object is placed correctly in the scene. 4. **Scene and Object Fusion**: The 3D object NeRF and the scene NeRF are fused to create a new 3D scene with the inserted object. 5. **Optional Refinement**: An optional refinement step can be applied to improve the appearance and details of the inserted object. The method is evaluated on various 3D scenes, demonstrating its effectiveness in generating and inserting objects consistently across multiple viewpoints. InseRF is capable of inserting objects in complex scenes without requiring explicit 3D placement information, making it a powerful tool for 3D scene editing and manipulation.InseRF is a novel method for generating and inserting objects into 3D scenes using neural radiance fields (NeRFs). The method addresses the challenge of inserting objects in arbitrary locations within 3D scenes, which existing methods often fail to achieve due to the multiview inconsistency of edits. InseRF leverages a user-provided textual description and a 2D bounding box in a reference viewpoint to generate and insert new objects. The process involves several key steps: 1. **2D View Generation**: A 2D view of the target object is generated in the reference viewpoint using a text-to-image diffusion model conditioned on the provided bounding box. 2. **3D Object Reconstruction**: The 2D view is used to reconstruct a 3D object NeRF, leveraging single-view object reconstruction methods. 3. **3D Placement**: The 3D placement of the object is estimated using monocular depth estimation, ensuring the object is placed correctly in the scene. 4. **Scene and Object Fusion**: The 3D object NeRF and the scene NeRF are fused to create a new 3D scene with the inserted object. 5. **Optional Refinement**: An optional refinement step can be applied to improve the appearance and details of the inserted object. The method is evaluated on various 3D scenes, demonstrating its effectiveness in generating and inserting objects consistently across multiple viewpoints. InseRF is capable of inserting objects in complex scenes without requiring explicit 3D placement information, making it a powerful tool for 3D scene editing and manipulation.

InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes

10 Jan 2024 | Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari