The paper introduces msGFM, a multisensor geospatial foundation model that integrates data from four key sensor modalities: RGB images, Sentinel-2, SAR, and DSM. msGFM is designed to handle both paired and unpaired sensor data, leveraging an innovative cross-sensor pre-training approach in masked image modeling. The model is trained on a comprehensive dataset of over 2 million images, demonstrating strong performance in various downstream tasks such as scene classification, segmentation, cloud removal, and pan-sharpening. The research highlights the limitations of existing representations in the geospatial domain and provides a guide for developing more advanced multisensor geospatial pre-training models. Key contributions include a novel cross-sensor paradigm, a high-performing pre-trained model, and a thorough analysis of the model's effectiveness and practical insights. The paper also discusses the importance of handling multisensor heterogeneity during pre-training and the benefits of using multiple sensor modalities in pre-training compared to single-sensor approaches.The paper introduces msGFM, a multisensor geospatial foundation model that integrates data from four key sensor modalities: RGB images, Sentinel-2, SAR, and DSM. msGFM is designed to handle both paired and unpaired sensor data, leveraging an innovative cross-sensor pre-training approach in masked image modeling. The model is trained on a comprehensive dataset of over 2 million images, demonstrating strong performance in various downstream tasks such as scene classification, segmentation, cloud removal, and pan-sharpening. The research highlights the limitations of existing representations in the geospatial domain and provides a guide for developing more advanced multisensor geospatial pre-training models. Key contributions include a novel cross-sensor paradigm, a high-performing pre-trained model, and a thorough analysis of the model's effectiveness and practical insights. The paper also discusses the importance of handling multisensor heterogeneity during pre-training and the benefits of using multiple sensor modalities in pre-training compared to single-sensor approaches.