2024 | Zhipeng Qian, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun
X-RefSeg3D is a novel model for referring 3D instance segmentation that enhances the task by integrating structured cross-modal graph neural networks. The model constructs a cross-modal graph for the input 3D scene and unites textual and spatial relationships for reasoning via graph neural networks. It begins by capturing object-specific text features, which are then fused with instance features to construct a comprehensive cross-modal scene graph. Subsequently, the obtained cross-modal features are integrated into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This enables the effective capture of higher-order relationships among instances, thereby enhancing feature fusion and facilitating reasoning. Finally, the refined feature undergoes a matching module to compute the ultimate matching score. Experimental results on ScanRefer demonstrate the effectiveness of the method, surpassing previous approaches by a substantial margin of +3.67% in terms of mIOU. The code and models are available at https://github.com/qzp2018/X-RefSeg3D. The model introduces two key modules: the Entity-Aware Fusion (EAF) and the Relation-Driven Interaction (RDI) modules. The EAF module selectively extracts textual features that describe entities, subsequently integrating them into the instance feature to craft a cross-modal scene graph. The RDI module focuses on high-order semantic comprehension, combining explicit instructions from expressions with factual relationships in scenes, yielding a comprehensive representation enriched with both relation and attribute. The incorporation of EAF and RDI into the X-RefSeg3D model yields a significant enhancement, demonstrating a remarkable improvement of +3.67% mIOU compared to the previous state-of-the-art method. The model achieves this by effectively integrating entity information from the expression with corresponding objects, utilizing the fused cross-modal features and spatial positions of instance to construct a comprehensive cross-modal scene graph. It also explores relative spatial relations among instances within the scene, constructing a cross-modal graph and determining edge types based on relative position vectors. The model further introduces the Relation-Driven Interaction (RDI) module, which employs a K-nearest algorithm to carry out local perception, facilitating the aggregation of positional information within expressions and scenes. The model combines expressions related to relative position with factual relationships in scenes to identify the target instance. The model's matching module uses two complementary objectives to predict the matching score between instances and corresponding expressions, and the final score is obtained by combining the two predicted scores. The model's loss function is a linear combination of the cosine loss and the prediction loss. The model's experiments on the ScanRefer dataset demonstrate its effectiveness, achieving a significant performance boost with GRU and BERT. The model's ablation studies show that the weighted edge gate contributes to better incorporating details of relative positional relationships within expressions. The model's inference layer ablation shows that the performance decreases gradually with the increase of the RDI module's inference layers. The model's K-neX-RefSeg3D is a novel model for referring 3D instance segmentation that enhances the task by integrating structured cross-modal graph neural networks. The model constructs a cross-modal graph for the input 3D scene and unites textual and spatial relationships for reasoning via graph neural networks. It begins by capturing object-specific text features, which are then fused with instance features to construct a comprehensive cross-modal scene graph. Subsequently, the obtained cross-modal features are integrated into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This enables the effective capture of higher-order relationships among instances, thereby enhancing feature fusion and facilitating reasoning. Finally, the refined feature undergoes a matching module to compute the ultimate matching score. Experimental results on ScanRefer demonstrate the effectiveness of the method, surpassing previous approaches by a substantial margin of +3.67% in terms of mIOU. The code and models are available at https://github.com/qzp2018/X-RefSeg3D. The model introduces two key modules: the Entity-Aware Fusion (EAF) and the Relation-Driven Interaction (RDI) modules. The EAF module selectively extracts textual features that describe entities, subsequently integrating them into the instance feature to craft a cross-modal scene graph. The RDI module focuses on high-order semantic comprehension, combining explicit instructions from expressions with factual relationships in scenes, yielding a comprehensive representation enriched with both relation and attribute. The incorporation of EAF and RDI into the X-RefSeg3D model yields a significant enhancement, demonstrating a remarkable improvement of +3.67% mIOU compared to the previous state-of-the-art method. The model achieves this by effectively integrating entity information from the expression with corresponding objects, utilizing the fused cross-modal features and spatial positions of instance to construct a comprehensive cross-modal scene graph. It also explores relative spatial relations among instances within the scene, constructing a cross-modal graph and determining edge types based on relative position vectors. The model further introduces the Relation-Driven Interaction (RDI) module, which employs a K-nearest algorithm to carry out local perception, facilitating the aggregation of positional information within expressions and scenes. The model combines expressions related to relative position with factual relationships in scenes to identify the target instance. The model's matching module uses two complementary objectives to predict the matching score between instances and corresponding expressions, and the final score is obtained by combining the two predicted scores. The model's loss function is a linear combination of the cosine loss and the prediction loss. The model's experiments on the ScanRefer dataset demonstrate its effectiveness, achieving a significant performance boost with GRU and BERT. The model's ablation studies show that the weighted edge gate contributes to better incorporating details of relative positional relationships within expressions. The model's inference layer ablation shows that the performance decreases gradually with the increase of the RDI module's inference layers. The model's K-ne