CrossFuse is a novel cross-attention mechanism-based approach for infrared and visible image fusion. The method aims to enhance complementary information between modalities while reducing redundant features. A two-stage training strategy is introduced, where two auto-encoders are first trained for each modality, followed by training of the cross-attention mechanism (CAM) and decoder. The CAM integrates features from both modalities, enhancing complementary information and reducing redundancy. Experimental results show that CrossFuse achieves state-of-the-art performance compared to existing fusion networks.
The proposed method uses a cross-attention mechanism that enhances intra-features of each modality and inter-features (complementary information) between modalities. The CAM includes self-attention (SA) and cross-attention (CA) blocks, with a novel reversed softmax operation to emphasize complementary information. The decoder uses skip connections to preserve detailed and salient features from source images. A novel attention-based loss function is introduced to train the network, combining intensity and gradient loss components.
The method is evaluated on public datasets TNO and VOT-RGBT, showing superior performance in terms of both visual and objective metrics. The results demonstrate that CrossFuse effectively enhances complementary information and preserves detailed features, outperforming existing fusion methods. The two-stage training strategy is found to be more efficient and effective than the one-stage approach. The proposed method is a hybrid CNN and transformer-based fusion network that addresses the limitations of existing transformer-based methods by focusing on complementary information rather than correlation. The method is expected to be effective in various applications, including medical diagnosis, surveillance, remote sensing, and robotics.CrossFuse is a novel cross-attention mechanism-based approach for infrared and visible image fusion. The method aims to enhance complementary information between modalities while reducing redundant features. A two-stage training strategy is introduced, where two auto-encoders are first trained for each modality, followed by training of the cross-attention mechanism (CAM) and decoder. The CAM integrates features from both modalities, enhancing complementary information and reducing redundancy. Experimental results show that CrossFuse achieves state-of-the-art performance compared to existing fusion networks.
The proposed method uses a cross-attention mechanism that enhances intra-features of each modality and inter-features (complementary information) between modalities. The CAM includes self-attention (SA) and cross-attention (CA) blocks, with a novel reversed softmax operation to emphasize complementary information. The decoder uses skip connections to preserve detailed and salient features from source images. A novel attention-based loss function is introduced to train the network, combining intensity and gradient loss components.
The method is evaluated on public datasets TNO and VOT-RGBT, showing superior performance in terms of both visual and objective metrics. The results demonstrate that CrossFuse effectively enhances complementary information and preserves detailed features, outperforming existing fusion methods. The two-stage training strategy is found to be more efficient and effective than the one-stage approach. The proposed method is a hybrid CNN and transformer-based fusion network that addresses the limitations of existing transformer-based methods by focusing on complementary information rather than correlation. The method is expected to be effective in various applications, including medical diagnosis, surveillance, remote sensing, and robotics.