13 May 2024 | Liuxin Bao, Xiaofei Zhou*, Xiankai Lu, Yaoqi Sun, Haibing Yin, Zhenghui Hu*, Jiyong Zhang, and Chenggang Yan
The paper introduces a Quality-aware Selective Fusion Network (QSF-Net) for visible-depth-thermal (VDT) salient object detection (SOD). The VDT SOD task leverages the complementary information from RGB, depth, and thermal images to enhance the performance of SOD. However, the quality of depth and thermal images can degrade the overall performance, especially in challenging scenarios. To address this issue, QSF-Net consists of three subnets: initial feature extraction, quality-aware region selection, and region-guided selective fusion. The initial feature extraction subnet generates preliminary prediction maps from each modality using a shrinkage pyramid architecture with multi-scale fusion (MSF) modules. The quality-aware region selection subnet identifies high-quality and low-quality regions by analyzing the preliminary predictions, which are used to generate pseudo ground truths for training. The region-guided selective fusion subnet purifies and fuses the triple-modal features guided by the quality-aware maps, enhancing the final saliency maps through intra-modality and inter-modality attention (IIA) and edge refinement (ER) modules. Extensive experiments on the VDT-2048 dataset demonstrate that QSF-Net outperforms state-of-the-art methods in terms of various evaluation metrics, including S-measure, MAE, F-measure, and E-measure. The model also shows robustness in challenging scenarios, such as low illumination, similar appearance, and background interference.The paper introduces a Quality-aware Selective Fusion Network (QSF-Net) for visible-depth-thermal (VDT) salient object detection (SOD). The VDT SOD task leverages the complementary information from RGB, depth, and thermal images to enhance the performance of SOD. However, the quality of depth and thermal images can degrade the overall performance, especially in challenging scenarios. To address this issue, QSF-Net consists of three subnets: initial feature extraction, quality-aware region selection, and region-guided selective fusion. The initial feature extraction subnet generates preliminary prediction maps from each modality using a shrinkage pyramid architecture with multi-scale fusion (MSF) modules. The quality-aware region selection subnet identifies high-quality and low-quality regions by analyzing the preliminary predictions, which are used to generate pseudo ground truths for training. The region-guided selective fusion subnet purifies and fuses the triple-modal features guided by the quality-aware maps, enhancing the final saliency maps through intra-modality and inter-modality attention (IIA) and edge refinement (ER) modules. Extensive experiments on the VDT-2048 dataset demonstrate that QSF-Net outperforms state-of-the-art methods in terms of various evaluation metrics, including S-measure, MAE, F-measure, and E-measure. The model also shows robustness in challenging scenarios, such as low illumination, similar appearance, and background interference.