GOI (3D Gaussians of Interest) is an innovative framework for 3D open-vocabulary scene understanding, integrating semantic features from 2D vision-language models into 3D Gaussian Splatting (3DGS). The key contributions of GOI include:
1. **Trainable Feature Clustering Codebook (TFCC)**: This method compresses high-dimensional semantic features into low-dimensional vectors, reducing storage and computational costs while maintaining clear semantic boundaries.
2. **Optimizable Semantic-space Hyperplane (OSH)**: OSH fine-tunes a hyperplane for each query, enhancing the accuracy of open-vocabulary querying by precisely locating target regions based on natural language prompts.
The approach involves several steps:
- **Pixel-level Semantic Feature Extraction**: Utilizing the APE model to extract pixel-aligned semantic features from multi-view images.
- **TFCC**: Compressing high-dimensional semantic features into a codebook to reduce redundancy and improve efficiency.
- **3D Gaussian Semantic Fields**: Reconstructing 3D Gaussian semantic fields using low-dimensional semantic features.
- **OSH Optimization**: Fine-tuning the hyperplane using a Referring Expression Segmentation (RES) model to enhance spatial perception and precise phrasal queries.
Experiments on the Mip-NeRF360 and Replica datasets demonstrate that GOI outperforms existing methods, achieving significant improvements in mean Intersection over Union (mIoU) and other metrics. The method is also applied to downstream tasks such as scene manipulation and editing, showcasing its practical utility.GOI (3D Gaussians of Interest) is an innovative framework for 3D open-vocabulary scene understanding, integrating semantic features from 2D vision-language models into 3D Gaussian Splatting (3DGS). The key contributions of GOI include:
1. **Trainable Feature Clustering Codebook (TFCC)**: This method compresses high-dimensional semantic features into low-dimensional vectors, reducing storage and computational costs while maintaining clear semantic boundaries.
2. **Optimizable Semantic-space Hyperplane (OSH)**: OSH fine-tunes a hyperplane for each query, enhancing the accuracy of open-vocabulary querying by precisely locating target regions based on natural language prompts.
The approach involves several steps:
- **Pixel-level Semantic Feature Extraction**: Utilizing the APE model to extract pixel-aligned semantic features from multi-view images.
- **TFCC**: Compressing high-dimensional semantic features into a codebook to reduce redundancy and improve efficiency.
- **3D Gaussian Semantic Fields**: Reconstructing 3D Gaussian semantic fields using low-dimensional semantic features.
- **OSH Optimization**: Fine-tuning the hyperplane using a Referring Expression Segmentation (RES) model to enhance spatial perception and precise phrasal queries.
Experiments on the Mip-NeRF360 and Replica datasets demonstrate that GOI outperforms existing methods, achieving significant improvements in mean Intersection over Union (mIoU) and other metrics. The method is also applied to downstream tasks such as scene manipulation and editing, showcasing its practical utility.