CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

June 18, 2024 | Hui Li, Xiao-Jun Wu
CrossFuse is a novel cross-attention mechanism-based approach for infrared and visible image fusion. The method aims to enhance complementary information between modalities while reducing redundant features. A two-stage training strategy is introduced, where two auto-encoders are first trained for each modality, followed by training of the cross-attention mechanism (CAM) and decoder. The CAM integrates features from both modalities, enhancing complementary information and reducing redundancy. Experimental results show that CrossFuse achieves state-of-the-art performance compared to existing fusion networks. The proposed method uses a cross-attention mechanism that enhances intra-features of each modality and inter-features (complementary information) between modalities. The CAM includes self-attention (SA) and cross-attention (CA) blocks, with a novel reversed softmax operation to emphasize complementary information. The decoder uses skip connections to preserve detailed and salient features from source images. A novel attention-based loss function is introduced to train the network, combining intensity and gradient loss components. The method is evaluated on public datasets TNO and VOT-RGBT, showing superior performance in terms of both visual and objective metrics. The results demonstrate that CrossFuse effectively enhances complementary information and preserves detailed features, outperforming existing fusion methods. The two-stage training strategy is found to be more efficient and effective than the one-stage approach. The proposed method is a hybrid CNN and transformer-based fusion network that addresses the limitations of existing transformer-based methods by focusing on complementary information rather than correlation. The method is expected to be effective in various applications, including medical diagnosis, surveillance, remote sensing, and robotics.CrossFuse is a novel cross-attention mechanism-based approach for infrared and visible image fusion. The method aims to enhance complementary information between modalities while reducing redundant features. A two-stage training strategy is introduced, where two auto-encoders are first trained for each modality, followed by training of the cross-attention mechanism (CAM) and decoder. The CAM integrates features from both modalities, enhancing complementary information and reducing redundancy. Experimental results show that CrossFuse achieves state-of-the-art performance compared to existing fusion networks. The proposed method uses a cross-attention mechanism that enhances intra-features of each modality and inter-features (complementary information) between modalities. The CAM includes self-attention (SA) and cross-attention (CA) blocks, with a novel reversed softmax operation to emphasize complementary information. The decoder uses skip connections to preserve detailed and salient features from source images. A novel attention-based loss function is introduced to train the network, combining intensity and gradient loss components. The method is evaluated on public datasets TNO and VOT-RGBT, showing superior performance in terms of both visual and objective metrics. The results demonstrate that CrossFuse effectively enhances complementary information and preserves detailed features, outperforming existing fusion methods. The two-stage training strategy is found to be more efficient and effective than the one-stage approach. The proposed method is a hybrid CNN and transformer-based fusion network that addresses the limitations of existing transformer-based methods by focusing on complementary information rather than correlation. The method is expected to be effective in various applications, including medical diagnosis, surveillance, remote sensing, and robotics.
Reach us at info@study.space