Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

23 Aug 2024 | Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li
Semantic Gaussians is a novel approach for open-vocabulary 3D scene understanding using 3D Gaussian Splatting. The method distills knowledge from 2D pre-trained models into 3D Gaussians, enabling semantic component assignment to each Gaussian point. A versatile projection framework maps 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationships and requires no additional training. A 3D semantic network directly predicts the semantic component from raw 3D Gaussians for fast inference. Quantitative results on ScanNet segmentation and LERF object localization demonstrate the method's superior performance. The approach also explores applications including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation, showing better qualitative results compared to 2D and 3D baselines. The method is flexible, leveraging arbitrary pre-trained 2D models like OpenSeg, CLIP, and VLPart to generate pixel-wise semantic features on 2D RGB images. The 3D semantic network uses MinkowskiNet to process 3D Gaussians and is supervised by the semantic components obtained from the projection method. The method achieves faster inference and better generalization compared to 2D projection. Experiments on ScanNet and LERF datasets show the effectiveness of the method on open-vocabulary scene understanding and various applications. The contributions include introducing Semantic Gaussians, proposing a versatile semantic feature projection framework, and introducing a 3D semantic network for direct prediction of semantic components from raw 3D Gaussians. The method is effective for 3D scene understanding and supports diverse downstream tasks.Semantic Gaussians is a novel approach for open-vocabulary 3D scene understanding using 3D Gaussian Splatting. The method distills knowledge from 2D pre-trained models into 3D Gaussians, enabling semantic component assignment to each Gaussian point. A versatile projection framework maps 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationships and requires no additional training. A 3D semantic network directly predicts the semantic component from raw 3D Gaussians for fast inference. Quantitative results on ScanNet segmentation and LERF object localization demonstrate the method's superior performance. The approach also explores applications including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation, showing better qualitative results compared to 2D and 3D baselines. The method is flexible, leveraging arbitrary pre-trained 2D models like OpenSeg, CLIP, and VLPart to generate pixel-wise semantic features on 2D RGB images. The 3D semantic network uses MinkowskiNet to process 3D Gaussians and is supervised by the semantic components obtained from the projection method. The method achieves faster inference and better generalization compared to 2D projection. Experiments on ScanNet and LERF datasets show the effectiveness of the method on open-vocabulary scene understanding and various applications. The contributions include introducing Semantic Gaussians, proposing a versatile semantic feature projection framework, and introducing a 3D semantic network for direct prediction of semantic components from raw 3D Gaussians. The method is effective for 3D scene understanding and supports diverse downstream tasks.
Reach us at info@study.space
[slides] Semantic Gaussians%3A Open-Vocabulary Scene Understanding with 3D Gaussian Splatting | StudySpace