10 Apr 2024 | Mi Yan, Jiazhou Zhang, Yan Zhu, He Wang
MaskClustering is a view consensus-based mask graph clustering method for open-vocabulary 3D instance segmentation. The method addresses the challenges of open-vocabulary instance segmentation by generating detailed segmentation across objects of varying scales and enabling query of these objects using open-vocabulary text. It constructs a global mask graph where each node is a 2D mask, and edges are determined by the view consensus rate, which measures the proportion of frames supporting the merging of two masks. By iteratively clustering masks with high view consensus, the method generates clusters representing distinct 3D instances. The method is training-free and achieves state-of-the-art performance on publicly available datasets such as ScanNet++, ScanNet200, and MatterPort3D. The key contributions include a novel graph clustering methodology for merging 2D masks, a novel view consensus metric for evaluating 2D mask relationships, and a state-of-the-art open-vocabulary 3D instance segmentation method. The method leverages 2D masks to create a global mask graph, and through iterative clustering and updating, generates a final list of clusters, each containing multiple masks and denoting a 3D instance. The method also aggregates point clouds and semantic features from individual 2D masks to create an open-vocabulary feature for each 3D instance. The method outperforms existing methods in zero-shot mask prediction and open-vocabulary instance understanding, especially in segmenting fine-grained objects. The method is validated on ScanNet++ and MatterPort3D benchmarks, achieving significant improvements in performance across various metrics. The method is also compared with other state-of-the-art methods, demonstrating its effectiveness in open-vocabulary 3D instance segmentation. The method is robust to oversegmentation errors and shows enhanced performance compared to approaches that rely solely on local geometric overlap. The method is also efficient in computation and can handle large-scale data. The method is applicable to a wide range of applications, including robotics and VR/AR.MaskClustering is a view consensus-based mask graph clustering method for open-vocabulary 3D instance segmentation. The method addresses the challenges of open-vocabulary instance segmentation by generating detailed segmentation across objects of varying scales and enabling query of these objects using open-vocabulary text. It constructs a global mask graph where each node is a 2D mask, and edges are determined by the view consensus rate, which measures the proportion of frames supporting the merging of two masks. By iteratively clustering masks with high view consensus, the method generates clusters representing distinct 3D instances. The method is training-free and achieves state-of-the-art performance on publicly available datasets such as ScanNet++, ScanNet200, and MatterPort3D. The key contributions include a novel graph clustering methodology for merging 2D masks, a novel view consensus metric for evaluating 2D mask relationships, and a state-of-the-art open-vocabulary 3D instance segmentation method. The method leverages 2D masks to create a global mask graph, and through iterative clustering and updating, generates a final list of clusters, each containing multiple masks and denoting a 3D instance. The method also aggregates point clouds and semantic features from individual 2D masks to create an open-vocabulary feature for each 3D instance. The method outperforms existing methods in zero-shot mask prediction and open-vocabulary instance understanding, especially in segmenting fine-grained objects. The method is validated on ScanNet++ and MatterPort3D benchmarks, achieving significant improvements in performance across various metrics. The method is also compared with other state-of-the-art methods, demonstrating its effectiveness in open-vocabulary 3D instance segmentation. The method is robust to oversegmentation errors and shows enhanced performance compared to approaches that rely solely on local geometric overlap. The method is also efficient in computation and can handle large-scale data. The method is applicable to a wide range of applications, including robotics and VR/AR.