1 March 2024 | Wenjian Zhang, Zheng Tan, Qunbo Lv, Jiaao Li, Baoyu Zhu and Yangyang Liu
An efficient hybrid CNN-Transformer approach for remote sensing super-resolution is proposed in this paper. The proposed model, EHNet, combines a lightweight convolution module with an improved Swin Transformer within a UNet-like architecture. The encoder uses a novel Lightweight Feature Extraction Block (LFEB) that employs depthwise convolution and a Cross Stage Partial structure for efficient feature extraction. The decoder incorporates a sequence-based upsample block (SUB) that focuses on semantic information through a multi-layer perceptron (MLP) layer, enhancing feature expression and reconstruction accuracy. Experiments on the UCMerced and AID datasets show that EHNet achieves state-of-the-art performance with PSNR values of 28.02 and 29.44, respectively, and outperforms existing methods in visual quality. The model has only 2.64 million parameters, effectively balancing efficiency and computational demands. The LFEB is designed to extract rich features with low computational cost, while the SUB improves detail recovery by focusing on semantic information. The model is evaluated on both natural and remote sensing image super-resolution tasks, demonstrating its effectiveness in various scenarios. EHNet achieves superior performance in terms of PSNR and SSIM metrics, and its results are visually better than other methods. The model's architecture is efficient and suitable for applications requiring high-quality image reconstruction.An efficient hybrid CNN-Transformer approach for remote sensing super-resolution is proposed in this paper. The proposed model, EHNet, combines a lightweight convolution module with an improved Swin Transformer within a UNet-like architecture. The encoder uses a novel Lightweight Feature Extraction Block (LFEB) that employs depthwise convolution and a Cross Stage Partial structure for efficient feature extraction. The decoder incorporates a sequence-based upsample block (SUB) that focuses on semantic information through a multi-layer perceptron (MLP) layer, enhancing feature expression and reconstruction accuracy. Experiments on the UCMerced and AID datasets show that EHNet achieves state-of-the-art performance with PSNR values of 28.02 and 29.44, respectively, and outperforms existing methods in visual quality. The model has only 2.64 million parameters, effectively balancing efficiency and computational demands. The LFEB is designed to extract rich features with low computational cost, while the SUB improves detail recovery by focusing on semantic information. The model is evaluated on both natural and remote sensing image super-resolution tasks, demonstrating its effectiveness in various scenarios. EHNet achieves superior performance in terms of PSNR and SSIM metrics, and its results are visually better than other methods. The model's architecture is efficient and suitable for applications requiring high-quality image reconstruction.