A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

2024 | Xinyu Wang, Le Sun, Chuhan Lu, Baozhu Li
The paper introduces a novel dual-branch deep learning model, the TNCCA (Transformer Network with CNN-Enhanced Cross-Attention), designed for hyperspectral image (HSI) classification. The model aims to address the limitations of existing methods that focus on extracting spatial-spectral features from single-size HSI data, by leveraging multi-scale feature information. The TNCCA consists of two main components: a multi-scale shallow feature extraction module and a transformer with CNN-enhanced cross-attention module. The shallow feature extraction module uses different scales of HSI input data to extract shallow spatial-spectral features using a multi-scale 3D and 2D hybrid convolutional neural network. The transformer module, enhanced with CNN, employs 2D convolutions and dilated convolutions to generate Q, K, and V tokens at different scales, enabling the model to explore and fuse multi-scale features from both branches. Experimental results on three widely used HSI datasets (Houston2013, Trento, and Pavia University) demonstrate that the proposed TNCCA model outperforms state-of-the-art methods in terms of classification accuracy, even under limited sample sizes. The model's effectiveness is further validated through ablation studies and inference speed analysis, showing its robustness and efficiency.The paper introduces a novel dual-branch deep learning model, the TNCCA (Transformer Network with CNN-Enhanced Cross-Attention), designed for hyperspectral image (HSI) classification. The model aims to address the limitations of existing methods that focus on extracting spatial-spectral features from single-size HSI data, by leveraging multi-scale feature information. The TNCCA consists of two main components: a multi-scale shallow feature extraction module and a transformer with CNN-enhanced cross-attention module. The shallow feature extraction module uses different scales of HSI input data to extract shallow spatial-spectral features using a multi-scale 3D and 2D hybrid convolutional neural network. The transformer module, enhanced with CNN, employs 2D convolutions and dilated convolutions to generate Q, K, and V tokens at different scales, enabling the model to explore and fuse multi-scale features from both branches. Experimental results on three widely used HSI datasets (Houston2013, Trento, and Pavia University) demonstrate that the proposed TNCCA model outperforms state-of-the-art methods in terms of classification accuracy, even under limited sample sizes. The model's effectiveness is further validated through ablation studies and inference speed analysis, showing its robustness and efficiency.
Reach us at info@study.space