More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

2020 | Danfeng Hong, Member, IEEE, Lianru Gao, Senior Member, IEEE, Naoto Yokoya, Member, IEEE, Jing Yao, Jocelyn Chanussot, Fellow, IEEE, Qian Du, Fellow, IEEE, and Bing Zhang, Fellow, IEEE
This paper presents a general multimodal deep learning (MDL) framework for remote sensing (RS) imagery classification, addressing the challenge of using diverse and complementary data sources to improve classification accuracy. The framework, named MDL-RS, consists of two subnetworks: the Extraction Network (Ex-Net) and the Fusion Network (Fu-Net). Ex-Net extracts hierarchical representations from different modalities, while Fu-Net fuses these representations to enhance classification performance. The paper investigates various fusion strategies, including early, middle, late, encoder-decoder (En-De), and cross fusion, focusing on "what," "where," and "how" to fuse information. The MDL-RS framework is evaluated on two multimodal RS datasets: HS-LiDAR Houston2013 and MS-SAR LCZ, demonstrating superior performance compared to single-modal approaches and other fusion methods. The results show that compactness-based fusion strategies, such as cross fusion, outperform concatenation-based methods, especially in cross-modality learning (CML) scenarios. The paper also highlights the importance of spatial-spectral information in improving classification accuracy, particularly in challenging datasets like LCZ. Overall, the MDL-RS framework provides a robust solution for pixel-level RS image classification using multimodal data, with potential applications in urban planning, forest monitoring, and disaster response.This paper presents a general multimodal deep learning (MDL) framework for remote sensing (RS) imagery classification, addressing the challenge of using diverse and complementary data sources to improve classification accuracy. The framework, named MDL-RS, consists of two subnetworks: the Extraction Network (Ex-Net) and the Fusion Network (Fu-Net). Ex-Net extracts hierarchical representations from different modalities, while Fu-Net fuses these representations to enhance classification performance. The paper investigates various fusion strategies, including early, middle, late, encoder-decoder (En-De), and cross fusion, focusing on "what," "where," and "how" to fuse information. The MDL-RS framework is evaluated on two multimodal RS datasets: HS-LiDAR Houston2013 and MS-SAR LCZ, demonstrating superior performance compared to single-modal approaches and other fusion methods. The results show that compactness-based fusion strategies, such as cross fusion, outperform concatenation-based methods, especially in cross-modality learning (CML) scenarios. The paper also highlights the importance of spatial-spectral information in improving classification accuracy, particularly in challenging datasets like LCZ. Overall, the MDL-RS framework provides a robust solution for pixel-level RS image classification using multimodal data, with potential applications in urban planning, forest monitoring, and disaster response.
Reach us at info@study.space