This paper proposes a general multimodal deep learning (MDL) framework for remote sensing (RS) imagery classification, aiming to address the limitations of single-modality approaches by leveraging multiple data sources. The framework, named MDL-RS, integrates feature extraction and fusion modules to enhance classification performance. It addresses three key questions: "what" to fuse, "where" to fuse, and "how" to fuse. The framework supports both common multi-modality learning (MML) and special cross-modality learning (CML) scenarios. Five fusion strategies are introduced, including early fusion, middle fusion, late fusion, encoder-decoder (En-De) fusion, and cross fusion. The cross fusion strategy is particularly effective in transferring information across modalities. The framework is validated on two multimodal RS datasets, demonstrating superior performance compared to single-modality approaches. The MDL-RS framework is applicable to both pixel-wise classification and spatial-spectral modeling using convolutional neural networks (CNNs). The results show that compactness-based fusion strategies outperform concatenation-based ones, especially in CML tasks. The framework is also shown to be robust against various image degradations and effective in transferring knowledge across different modalities. The paper highlights the importance of diverse data sources in improving classification accuracy and provides a comprehensive analysis of different fusion strategies for RS image classification.This paper proposes a general multimodal deep learning (MDL) framework for remote sensing (RS) imagery classification, aiming to address the limitations of single-modality approaches by leveraging multiple data sources. The framework, named MDL-RS, integrates feature extraction and fusion modules to enhance classification performance. It addresses three key questions: "what" to fuse, "where" to fuse, and "how" to fuse. The framework supports both common multi-modality learning (MML) and special cross-modality learning (CML) scenarios. Five fusion strategies are introduced, including early fusion, middle fusion, late fusion, encoder-decoder (En-De) fusion, and cross fusion. The cross fusion strategy is particularly effective in transferring information across modalities. The framework is validated on two multimodal RS datasets, demonstrating superior performance compared to single-modality approaches. The MDL-RS framework is applicable to both pixel-wise classification and spatial-spectral modeling using convolutional neural networks (CNNs). The results show that compactness-based fusion strategies outperform concatenation-based ones, especially in CML tasks. The framework is also shown to be robust against various image degradations and effective in transferring knowledge across different modalities. The paper highlights the importance of diverse data sources in improving classification accuracy and provides a comprehensive analysis of different fusion strategies for RS image classification.