Understanding Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

The paper introduces a novel network called Cross-Modal Fusion and Progressive Decoding Network (CPNet) for RGB-D salient object detection (SOD). CPNet is designed to address the issue of feature redundancy and performance degradation often associated with additional modules like feature enhancement and edge generation. The network consists of three main components: feature encoding, feature fusion, and feature decoding. - **Feature Encoding**: Utilizes a two-stream Swin Transformer encoder to extract multi-level and multi-scale features from both RGB and depth images. - **Feature Fusion**: Features a cross-modal attention fusion module that combines coordinate and spatial attention to integrate multi-modality and multi-level features effectively. - **Feature Decoding**: Includes a progressive decoder that gradually fuses low-level features and filters noise to accurately predict salient objects. Experimental results on six benchmarks show that CPNet outperforms 12 state-of-the-art methods in four evaluation metrics. The study also highlights that adding feature enhancement and edge generation modules is not beneficial for RGB-D SOD tasks, providing new insights into the task. The code for CPNet is available on GitHub.The paper introduces a novel network called Cross-Modal Fusion and Progressive Decoding Network (CPNet) for RGB-D salient object detection (SOD). CPNet is designed to address the issue of feature redundancy and performance degradation often associated with additional modules like feature enhancement and edge generation. The network consists of three main components: feature encoding, feature fusion, and feature decoding. - **Feature Encoding**: Utilizes a two-stream Swin Transformer encoder to extract multi-level and multi-scale features from both RGB and depth images. - **Feature Fusion**: Features a cross-modal attention fusion module that combines coordinate and spatial attention to integrate multi-modality and multi-level features effectively. - **Feature Decoding**: Includes a progressive decoder that gradually fuses low-level features and filters noise to accurately predict salient objects. Experimental results on six benchmarks show that CPNet outperforms 12 state-of-the-art methods in four evaluation metrics. The study also highlights that adding feature enhancement and edge generation modules is not beneficial for RGB-D SOD tasks, providing new insights into the task. The code for CPNet is available on GitHub.

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

2 March 2024 | Xihang Hu, Fuming Sun, Jing Sun, Fasheng Wang, Haojie Li