2024 | Zhipeng Qian*, Yiwei Ma*, Jiayi Ji, Xiaoshuai Sun†
X-RefSeg3D is a novel model designed to enhance referring 3D instance segmentation by integrating textual and spatial relationships using structured cross-modal graph neural networks. The model addresses the limitations of previous methods, which often overlook the distinct roles of different words in referring expressions and fail to incorporate positional relationships within expressions with spatial correlations in 3D scenes. X-RefSeg3D constructs a cross-modal graph for the input 3D scene, fusing object-specific text features with instance features to create a comprehensive scene graph. It then integrates these features into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This approach enables the model to capture higher-order relationships among instances, enhancing feature fusion and facilitating reasoning. The refined features undergo a matching module to compute the final matching score. Experimental results on the ScanRefer dataset demonstrate the effectiveness of X-RefSeg3D, achieving a significant improvement of +3.67% in mIOU compared to previous state-of-the-art methods. The code and models are available at https://github.com/qzp2018/X-RefSeg3D.X-RefSeg3D is a novel model designed to enhance referring 3D instance segmentation by integrating textual and spatial relationships using structured cross-modal graph neural networks. The model addresses the limitations of previous methods, which often overlook the distinct roles of different words in referring expressions and fail to incorporate positional relationships within expressions with spatial correlations in 3D scenes. X-RefSeg3D constructs a cross-modal graph for the input 3D scene, fusing object-specific text features with instance features to create a comprehensive scene graph. It then integrates these features into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This approach enables the model to capture higher-order relationships among instances, enhancing feature fusion and facilitating reasoning. The refined features undergo a matching module to compute the final matching score. Experimental results on the ScanRefer dataset demonstrate the effectiveness of X-RefSeg3D, achieving a significant improvement of +3.67% in mIOU compared to previous state-of-the-art methods. The code and models are available at https://github.com/qzp2018/X-RefSeg3D.