GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

27 Jul 2024 | Yansong Qu*, Shaohui Dai*, Xinyang Li, Jianghang Lin, Liujuan Cao†, Shengchuan Zhang, Rongrong Ji
This paper introduces GOI, a method for 3D open-vocabulary scene understanding based on 3D Gaussian Splatting. The key contribution is the introduction of an Optimizable Semantic-space Hyperplane (OSH), which enables precise location of target regions in response to natural language prompts. GOI integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using OSH. To reduce computational costs, a Trainable Feature Clustering Codebook (TFCC) is introduced to compress high-dimensional semantic features into compact low-dimensional vectors. The OSH is fine-tuned using a Referring Expression Segmentation (RES) model to enhance spatial perception for precise phrasal queries. Extensive experiments demonstrate that GOI outperforms existing state-of-the-art methods, achieving significant improvements in mean Intersection over Union (mIoU) on the Mip-NeRF360 and Replica datasets. GOI is practical for a range of downstream applications, including scene manipulation and editing. The method leverages the power of semantic redundancy to cluster features into a TFCC, enabling efficient encoding of diverse object features and precise identification of objects within a scene. The approach also dynamically optimizes a semantic-space hyperplane to filter out unnecessary objects from the 3D Gaussians of Interest. The method is implemented based on 3D Gaussian Splatting and can be trained on a single 40G-A100 GPU in approximately 10 minutes. The results show that GOI achieves significant improvements in performance compared to other methods, demonstrating its potential for downstream tasks such as localized scene editing.This paper introduces GOI, a method for 3D open-vocabulary scene understanding based on 3D Gaussian Splatting. The key contribution is the introduction of an Optimizable Semantic-space Hyperplane (OSH), which enables precise location of target regions in response to natural language prompts. GOI integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using OSH. To reduce computational costs, a Trainable Feature Clustering Codebook (TFCC) is introduced to compress high-dimensional semantic features into compact low-dimensional vectors. The OSH is fine-tuned using a Referring Expression Segmentation (RES) model to enhance spatial perception for precise phrasal queries. Extensive experiments demonstrate that GOI outperforms existing state-of-the-art methods, achieving significant improvements in mean Intersection over Union (mIoU) on the Mip-NeRF360 and Replica datasets. GOI is practical for a range of downstream applications, including scene manipulation and editing. The method leverages the power of semantic redundancy to cluster features into a TFCC, enabling efficient encoding of diverse object features and precise identification of objects within a scene. The approach also dynamically optimizes a semantic-space hyperplane to filter out unnecessary objects from the 3D Gaussians of Interest. The method is implemented based on 3D Gaussian Splatting and can be trained on a single 40G-A100 GPU in approximately 10 minutes. The results show that GOI achieves significant improvements in performance compared to other methods, demonstrating its potential for downstream tasks such as localized scene editing.
Reach us at info@study.space