This paper presents a general multimodal deep learning (MDL) framework for remote sensing (RS) imagery classification, addressing the challenge of using diverse and complementary data sources to improve classification accuracy. The framework, named MDL-RS, consists of two subnetworks: the Extraction Network (Ex-Net) and the Fusion Network (Fu-Net). Ex-Net extracts hierarchical representations from different modalities, while Fu-Net fuses these representations to enhance classification performance. The paper investigates various fusion strategies, including early, middle, late, encoder-decoder (En-De), and cross fusion, focusing on "what," "where," and "how" to fuse information. The MDL-RS framework is evaluated on two multimodal RS datasets: HS-LiDAR Houston2013 and MS-SAR LCZ, demonstrating superior performance compared to single-modal approaches and other fusion methods. The results show that compactness-based fusion strategies, such as cross fusion, outperform concatenation-based methods, especially in cross-modality learning (CML) scenarios. The paper also highlights the importance of spatial-spectral information in improving classification accuracy, particularly in challenging datasets like LCZ. Overall, the MDL-RS framework provides a robust solution for pixel-level RS image classification using multimodal data, with potential applications in urban planning, forest monitoring, and disaster response.This paper presents a general multimodal deep learning (MDL) framework for remote sensing (RS) imagery classification, addressing the challenge of using diverse and complementary data sources to improve classification accuracy. The framework, named MDL-RS, consists of two subnetworks: the Extraction Network (Ex-Net) and the Fusion Network (Fu-Net). Ex-Net extracts hierarchical representations from different modalities, while Fu-Net fuses these representations to enhance classification performance. The paper investigates various fusion strategies, including early, middle, late, encoder-decoder (En-De), and cross fusion, focusing on "what," "where," and "how" to fuse information. The MDL-RS framework is evaluated on two multimodal RS datasets: HS-LiDAR Houston2013 and MS-SAR LCZ, demonstrating superior performance compared to single-modal approaches and other fusion methods. The results show that compactness-based fusion strategies, such as cross fusion, outperform concatenation-based methods, especially in cross-modality learning (CML) scenarios. The paper also highlights the importance of spatial-spectral information in improving classification accuracy, particularly in challenging datasets like LCZ. Overall, the MDL-RS framework provides a robust solution for pixel-level RS image classification using multimodal data, with potential applications in urban planning, forest monitoring, and disaster response.