N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

28 Jul 2024 | Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi
**N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields** **Authors:** Yash Bhagat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi **Affiliation:** Visual Geometry Group, University of Oxford **Abstract:** Understanding complex scenes at multiple levels of abstraction remains a significant challenge in computer vision. To address this, the authors introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field. This feature field encodes scene properties at varying granularities within a high-dimensional space, allowing for flexible definition of hierarchies based on physical or semantic dimensions. N2F2 leverages a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings and uses the CLIP vision-encoder to obtain language-aligned embeddings for each segment. The hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments demonstrate that N2F2 outperforms state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, particularly for complex compound queries. **Keywords:** - Hierarchical Scene Understanding - Feature Field Distillation - Open-Vocabulary 3D Segmentation **Introduction:** The paper discusses the challenges of 3D scene understanding, emphasizing the hierarchical nature of scene understanding, which requires reasoning about scenes at varying levels of geometric and semantic detail. Recent progress in radiance fields, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, has advanced 3D scene understanding by extracting shape and appearance from multiple RGB images. However, existing methods often struggle with complex linguistic constructs and are inefficient during inference. N2F2 aims to address these limitations by imposing a hierarchical structure on the 3D feature field during training, enabling the model to capture multi-scale scene representations coherently and efficiently. **Method:** N2F2 uses a 3D Gaussian Splatting (3DGS) model to represent the scene and augment it with a feature field that captures scene properties across different scales and semantic granularities. The feature field is optimized using a scale-aware hierarchical supervision method, where different subsets of the feature vectors encode varying levels of detail. A composite embedding approach is also introduced to enable efficient open-vocabulary querying without explicit scale selection. **Experiments:** The authors evaluate N2F2 on challenging datasets, including the expanded LERF dataset and the 3D-OVS dataset. Results show that N2F2 significantly outperforms state-of-the-art methods in terms of localization accuracy and 3D semantic segmentation performance, especially for complex compound queries. **Conclusion:** N2F2 introduces a novel approach for hierarchical scene understanding by learning**N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields** **Authors:** Yash Bhagat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi **Affiliation:** Visual Geometry Group, University of Oxford **Abstract:** Understanding complex scenes at multiple levels of abstraction remains a significant challenge in computer vision. To address this, the authors introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field. This feature field encodes scene properties at varying granularities within a high-dimensional space, allowing for flexible definition of hierarchies based on physical or semantic dimensions. N2F2 leverages a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings and uses the CLIP vision-encoder to obtain language-aligned embeddings for each segment. The hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments demonstrate that N2F2 outperforms state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, particularly for complex compound queries. **Keywords:** - Hierarchical Scene Understanding - Feature Field Distillation - Open-Vocabulary 3D Segmentation **Introduction:** The paper discusses the challenges of 3D scene understanding, emphasizing the hierarchical nature of scene understanding, which requires reasoning about scenes at varying levels of geometric and semantic detail. Recent progress in radiance fields, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, has advanced 3D scene understanding by extracting shape and appearance from multiple RGB images. However, existing methods often struggle with complex linguistic constructs and are inefficient during inference. N2F2 aims to address these limitations by imposing a hierarchical structure on the 3D feature field during training, enabling the model to capture multi-scale scene representations coherently and efficiently. **Method:** N2F2 uses a 3D Gaussian Splatting (3DGS) model to represent the scene and augment it with a feature field that captures scene properties across different scales and semantic granularities. The feature field is optimized using a scale-aware hierarchical supervision method, where different subsets of the feature vectors encode varying levels of detail. A composite embedding approach is also introduced to enable efficient open-vocabulary querying without explicit scale selection. **Experiments:** The authors evaluate N2F2 on challenging datasets, including the expanded LERF dataset and the 3D-OVS dataset. Results show that N2F2 significantly outperforms state-of-the-art methods in terms of localization accuracy and 3D semantic segmentation performance, especially for complex compound queries. **Conclusion:** N2F2 introduces a novel approach for hierarchical scene understanding by learning
Reach us at info@study.space
[slides and audio] N2F2%3A Hierarchical Scene Understanding with Nested Neural Feature Fields