22 Apr 2024 | Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, Kanglin Liu
CLIP-GS is a method that integrates CLIP semantics into 3D Gaussian Splatting (GS) to achieve real-time and view-consistent 3D semantic understanding without annotated data. It addresses the limitations of existing methods by introducing two key approaches: Semantic Attribute Compactness (SAC) and 3D Coherent Self-training (3DCS). SAC reduces the dimensionality of semantic features by leveraging the unified semantics of objects, enabling efficient rendering (>100 FPS). 3DCS enhances semantic consistency across views by using self-predicted pseudo-labels derived from the 3D Gaussian model, improving segmentation accuracy. Extensive experiments show that CLIP-GS outperforms state-of-the-art methods on Replica and ScanNet datasets, achieving significant improvements in mIoU metrics. It also performs well with sparse input data, demonstrating robustness. The method efficiently represents 3D scene semantics using 3D Gaussians, ensuring high-quality rendering and accurate semantic understanding. CLIP-GS is designed for real-time applications and provides consistent semantic results across different views.CLIP-GS is a method that integrates CLIP semantics into 3D Gaussian Splatting (GS) to achieve real-time and view-consistent 3D semantic understanding without annotated data. It addresses the limitations of existing methods by introducing two key approaches: Semantic Attribute Compactness (SAC) and 3D Coherent Self-training (3DCS). SAC reduces the dimensionality of semantic features by leveraging the unified semantics of objects, enabling efficient rendering (>100 FPS). 3DCS enhances semantic consistency across views by using self-predicted pseudo-labels derived from the 3D Gaussian model, improving segmentation accuracy. Extensive experiments show that CLIP-GS outperforms state-of-the-art methods on Replica and ScanNet datasets, achieving significant improvements in mIoU metrics. It also performs well with sparse input data, demonstrating robustness. The method efficiently represents 3D scene semantics using 3D Gaussians, ensuring high-quality rendering and accurate semantic understanding. CLIP-GS is designed for real-time applications and provides consistent semantic results across different views.