The paper "CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach" by Hui Li and Xiao-Jun Wu introduces a novel cross-attention mechanism (CAM) for enhancing the complementary information between infrared and visible images. The authors address the challenge of integrating multi-sensor data into a single image, which often involves significant differences between infrared and visible modalities. Traditional cross-attention modules focus on correlation, but image fusion tasks require focusing on complementarity. To overcome this, the proposed CAM enhances the complementary information while reducing redundancy.
The method employs a two-stage training strategy: the first stage involves training two auto-encoder networks with the same architecture for each modality, while the second stage trains the CAM and a decoder. The CAM integrates features from both modalities into a fused feature, enhancing complementarity and reducing redundancy. The fused image is then generated by the decoder.
Experimental results demonstrate that the proposed method outperforms existing fusion networks, achieving state-of-the-art performance. The method is evaluated on two datasets, TNO and VOT-RGBT, using various metrics, including entropy, standard deviation, mutual information, and feature-based mutual information. The results show that CrossFuse preserves more complementary information and enhances detail and salient features, making it a robust and efficient solution for multimodal image fusion.The paper "CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach" by Hui Li and Xiao-Jun Wu introduces a novel cross-attention mechanism (CAM) for enhancing the complementary information between infrared and visible images. The authors address the challenge of integrating multi-sensor data into a single image, which often involves significant differences between infrared and visible modalities. Traditional cross-attention modules focus on correlation, but image fusion tasks require focusing on complementarity. To overcome this, the proposed CAM enhances the complementary information while reducing redundancy.
The method employs a two-stage training strategy: the first stage involves training two auto-encoder networks with the same architecture for each modality, while the second stage trains the CAM and a decoder. The CAM integrates features from both modalities into a fused feature, enhancing complementarity and reducing redundancy. The fused image is then generated by the decoder.
Experimental results demonstrate that the proposed method outperforms existing fusion networks, achieving state-of-the-art performance. The method is evaluated on two datasets, TNO and VOT-RGBT, using various metrics, including entropy, standard deviation, mutual information, and feature-based mutual information. The results show that CrossFuse preserves more complementary information and enhances detail and salient features, making it a robust and efficient solution for multimodal image fusion.