10 Apr 2024 | Mi Yan, Jiazhao Zhang, Yan Zhu, He Wang
**MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation**
**Abstract:**
Open-vocabulary 3D instance segmentation aims to segment 3D objects without predefined categories, but it faces challenges due to limited annotated 3D data. Recent methods often generate 2D masks and then merge them into 3D instances based on local metrics between neighboring frames. In contrast, MaskClustering proposes a novel metric, the view consensus rate, to enhance the utilization of multi-view observations. The key insight is that two 2D masks should be considered part of the same 3D instance if a significant number of other 2D masks from different views contain both these two masks. Using this metric as edge weights, a global mask graph is constructed where each mask is a node. Through iterative clustering of masks with high view consensus, a series of clusters representing distinct 3D instances are generated. Notably, the model is training-free. Extensive experiments on datasets like ScanNet++, ScanNet200, and MatterPort3D demonstrate state-of-the-art performance in open-vocabulary 3D instance segmentation.
**Contributions:**
- A novel graph clustering methodology for merging 2D masks in 3D open-vocabulary instance segmentation.
- A novel view consensus metric for evaluating the relationship between 2D masks, leveraging global information from input image sequences.
- A state-of-the-art open-vocabulary 3D instance segmentation method, outperforming existing methods on various datasets.
**Method:**
1. **Problem Formulation:** Given color images, depths, and reconstructed point clouds, the algorithm outputs 3D instances with open-vocabulary semantics.
2. **Mask Graph Construction:** The view consensus rate is introduced to determine edge connectivity between masks, enhancing robustness against oversegmentation errors.
3. **Iterative Graph Clustering:** An iterative process clusters masks and updates the graph structure, prioritizing mask pairs with high view consensus.
4. **Open-Vocabulary Feature Aggregation:** Representative masks are selected and fused to create open-vocabulary features for each instance.
**Experiments:**
- **Quantitative Comparison:** MaskClustering outperforms baselines on ScanNet++ and MatterPort3D, achieving significant improvements in AP and class-agnostic settings.
- **Ablation Studies:** Key components like under-segment mask filtering and iterative clustering are analyzed, showing their effectiveness.
- **Qualitative Results:** The method demonstrates excellent performance in segmenting small objects and handling complex scenes.
**Limitations:**
- Assumes near-perfect 2D segmentation and 2D-3D correspondence, which may not always be feasible in real-world applications.
**Conclusion:**
MaskClustering achieves state-of-the-art performance in open-vocabulary 3D instance segmentation by leveraging view consensus in a global mask graph. Future work could explore applications in robotic tasks, such as open-vocabulary**MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation**
**Abstract:**
Open-vocabulary 3D instance segmentation aims to segment 3D objects without predefined categories, but it faces challenges due to limited annotated 3D data. Recent methods often generate 2D masks and then merge them into 3D instances based on local metrics between neighboring frames. In contrast, MaskClustering proposes a novel metric, the view consensus rate, to enhance the utilization of multi-view observations. The key insight is that two 2D masks should be considered part of the same 3D instance if a significant number of other 2D masks from different views contain both these two masks. Using this metric as edge weights, a global mask graph is constructed where each mask is a node. Through iterative clustering of masks with high view consensus, a series of clusters representing distinct 3D instances are generated. Notably, the model is training-free. Extensive experiments on datasets like ScanNet++, ScanNet200, and MatterPort3D demonstrate state-of-the-art performance in open-vocabulary 3D instance segmentation.
**Contributions:**
- A novel graph clustering methodology for merging 2D masks in 3D open-vocabulary instance segmentation.
- A novel view consensus metric for evaluating the relationship between 2D masks, leveraging global information from input image sequences.
- A state-of-the-art open-vocabulary 3D instance segmentation method, outperforming existing methods on various datasets.
**Method:**
1. **Problem Formulation:** Given color images, depths, and reconstructed point clouds, the algorithm outputs 3D instances with open-vocabulary semantics.
2. **Mask Graph Construction:** The view consensus rate is introduced to determine edge connectivity between masks, enhancing robustness against oversegmentation errors.
3. **Iterative Graph Clustering:** An iterative process clusters masks and updates the graph structure, prioritizing mask pairs with high view consensus.
4. **Open-Vocabulary Feature Aggregation:** Representative masks are selected and fused to create open-vocabulary features for each instance.
**Experiments:**
- **Quantitative Comparison:** MaskClustering outperforms baselines on ScanNet++ and MatterPort3D, achieving significant improvements in AP and class-agnostic settings.
- **Ablation Studies:** Key components like under-segment mask filtering and iterative clustering are analyzed, showing their effectiveness.
- **Qualitative Results:** The method demonstrates excellent performance in segmenting small objects and handling complex scenes.
**Limitations:**
- Assumes near-perfect 2D segmentation and 2D-3D correspondence, which may not always be feasible in real-world applications.
**Conclusion:**
MaskClustering achieves state-of-the-art performance in open-vocabulary 3D instance segmentation by leveraging view consensus in a global mask graph. Future work could explore applications in robotic tasks, such as open-vocabulary