13 May 2024 | Liuxin Bao, Xiaofei Zhou*, Xiankai Lu, Yaoqi Sun, Haibing Yin, Zhenghui Hu*, Jiyong Zhang, and Chenggang Yan
This paper proposes a Quality-aware Selective Fusion Network (QS-Net) for Visible-Depth-Thermal (VDT) salient object detection. The network consists of three subnets: initial feature extraction, quality-aware region selection, and region-guided selective fusion. The initial feature extraction subnet extracts features from RGB, depth, and thermal images using a shrinkage pyramid architecture with multi-scale fusion (MSF) modules. The quality-aware region selection subnet generates quality-aware maps by identifying high-quality and low-quality regions, which are used to train the network. The region-guided selective fusion subnet purifies the features and fuses them using intra-modality and inter-modality attention (IIA) modules and edge refinement (ER) modules. The network is evaluated on the VDT-2048 dataset, and results show that it outperforms 13 state-of-the-art methods in terms of multiple evaluation metrics. The model is robust to challenging scenarios, such as low-quality depth and thermal images, due to its quality-aware strategy. The network achieves competitive performance in terms of computational efficiency, with the smallest FLOPs and largest FPS among the compared models. The model is effective in detecting salient objects in complex scenes, where one modality may be of low quality. The results demonstrate the effectiveness and superiority of the proposed QS-Net in VDT salient object detection.This paper proposes a Quality-aware Selective Fusion Network (QS-Net) for Visible-Depth-Thermal (VDT) salient object detection. The network consists of three subnets: initial feature extraction, quality-aware region selection, and region-guided selective fusion. The initial feature extraction subnet extracts features from RGB, depth, and thermal images using a shrinkage pyramid architecture with multi-scale fusion (MSF) modules. The quality-aware region selection subnet generates quality-aware maps by identifying high-quality and low-quality regions, which are used to train the network. The region-guided selective fusion subnet purifies the features and fuses them using intra-modality and inter-modality attention (IIA) modules and edge refinement (ER) modules. The network is evaluated on the VDT-2048 dataset, and results show that it outperforms 13 state-of-the-art methods in terms of multiple evaluation metrics. The model is robust to challenging scenarios, such as low-quality depth and thermal images, due to its quality-aware strategy. The network achieves competitive performance in terms of computational efficiency, with the smallest FLOPs and largest FPS among the compared models. The model is effective in detecting salient objects in complex scenes, where one modality may be of low quality. The results demonstrate the effectiveness and superiority of the proposed QS-Net in VDT salient object detection.