Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

2 March 2024 | Xihang Hu, Fuming Sun, Jing Sun, Fasheng Wang, Haojie Li
This paper proposes a cross-modal fusion and progressive decoding network (CPNet) for RGB-D salient object detection (SOD). Existing methods often use additional modules like feature enhancement and edge generation, which can lead to feature redundancy and performance degradation. CPNet is designed with three essential components: feature encoding, feature fusion, and feature decoding. In feature encoding, a two-stream Swin Transformer extracts multi-level and multi-scale features from RGB and depth images to model global information. In feature fusion, a cross-modal attention fusion module uses attention mechanisms to combine multi-modality and multi-level features. In feature decoding, a progressive decoder gradually fuses low-level features and filters noise to accurately predict salient objects. Experimental results on six benchmarks show that CPNet outperforms 12 state-of-the-art methods in four metrics. Additionally, it is verified that adding feature enhancement and edge generation modules is not beneficial for RGB-D SOD under this framework, providing new insights into SOD. The code is available at https://github.com/hu-xh/CPNet. Keywords: Salient object detection · Cross-modality · Multi-scale feature aggregation · Attention mechanism SOD aims to detect the most attractive areas in an image and segment them accurately. Recent advances in CNNs have achieved great success in SOD, but they face challenges in complex scenarios. With the popularity of depth cameras, RGB-D SOD has become an attractive research direction. However, integrating complementary information from RGB and depth images remains a key issue. Some studies directly integrate depth and RGB images into a four-channel image, while others treat depth features as auxiliary information. Feature extraction can lead to loss of detailed information, causing blurred boundaries in saliency maps. Multiscale feature aggregation is also a key issue for precise salient object localization. Existing algorithms use attention mechanisms or ASPP to extract multi-scale information. However, these methods often require additional modules, leading to feature redundancy and computational costs. CPNet addresses these issues by using a cross-modal attention fusion module and a progressive decoder to achieve SOTA performance in RGB-D SOD.This paper proposes a cross-modal fusion and progressive decoding network (CPNet) for RGB-D salient object detection (SOD). Existing methods often use additional modules like feature enhancement and edge generation, which can lead to feature redundancy and performance degradation. CPNet is designed with three essential components: feature encoding, feature fusion, and feature decoding. In feature encoding, a two-stream Swin Transformer extracts multi-level and multi-scale features from RGB and depth images to model global information. In feature fusion, a cross-modal attention fusion module uses attention mechanisms to combine multi-modality and multi-level features. In feature decoding, a progressive decoder gradually fuses low-level features and filters noise to accurately predict salient objects. Experimental results on six benchmarks show that CPNet outperforms 12 state-of-the-art methods in four metrics. Additionally, it is verified that adding feature enhancement and edge generation modules is not beneficial for RGB-D SOD under this framework, providing new insights into SOD. The code is available at https://github.com/hu-xh/CPNet. Keywords: Salient object detection · Cross-modality · Multi-scale feature aggregation · Attention mechanism SOD aims to detect the most attractive areas in an image and segment them accurately. Recent advances in CNNs have achieved great success in SOD, but they face challenges in complex scenarios. With the popularity of depth cameras, RGB-D SOD has become an attractive research direction. However, integrating complementary information from RGB and depth images remains a key issue. Some studies directly integrate depth and RGB images into a four-channel image, while others treat depth features as auxiliary information. Feature extraction can lead to loss of detailed information, causing blurred boundaries in saliency maps. Multiscale feature aggregation is also a key issue for precise salient object localization. Existing algorithms use attention mechanisms or ASPP to extract multi-scale information. However, these methods often require additional modules, leading to feature redundancy and computational costs. CPNet addresses these issues by using a cross-modal attention fusion module and a progressive decoder to achieve SOTA performance in RGB-D SOD.
Reach us at info@futurestudyspace.com