This paper introduces msGFM, a multisensor geospatial foundation model that unifies data from four sensor modalities: SAR, Sentinel-2, RGB, and DSM. The model is trained on a large dataset of two million multisensor images and is capable of handling both paired and unpaired sensor data. msGFM employs an innovative cross-sensor pretraining approach in masked image modeling to synthesize joint representations from diverse sensors. The model demonstrates strong performance in various downstream tasks, including scene classification, segmentation, cloud removal, and pan-sharpening. A key finding is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, highlighting the limitations of existing representations in this field. The paper also explores the effectiveness of leveraging or distilling features from established vision models for multisensor geospatial pretraining and addresses the challenge of mitigating multisensor heterogeneity during pretraining. The contributions include the introduction of a novel cross-sensor paradigm for joint representation learning, a high-performing pretrained model trained on a comprehensive multisensor dataset, and a thorough analysis of the model's performance. The results show that msGFM outperforms existing models in several downstream tasks, demonstrating the benefits of multisensor pretraining. The paper also highlights the importance of pretraining from scratch, as distillation methods face a significant domain gap between natural images and geospatial-specific sensors. The study concludes that multisensor pretraining is crucial for improving geospatial task performance and that further research is needed to address challenges such as incorporating temporal information into pretrained models.This paper introduces msGFM, a multisensor geospatial foundation model that unifies data from four sensor modalities: SAR, Sentinel-2, RGB, and DSM. The model is trained on a large dataset of two million multisensor images and is capable of handling both paired and unpaired sensor data. msGFM employs an innovative cross-sensor pretraining approach in masked image modeling to synthesize joint representations from diverse sensors. The model demonstrates strong performance in various downstream tasks, including scene classification, segmentation, cloud removal, and pan-sharpening. A key finding is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, highlighting the limitations of existing representations in this field. The paper also explores the effectiveness of leveraging or distilling features from established vision models for multisensor geospatial pretraining and addresses the challenge of mitigating multisensor heterogeneity during pretraining. The contributions include the introduction of a novel cross-sensor paradigm for joint representation learning, a high-performing pretrained model trained on a comprehensive multisensor dataset, and a thorough analysis of the model's performance. The results show that msGFM outperforms existing models in several downstream tasks, demonstrating the benefits of multisensor pretraining. The paper also highlights the importance of pretraining from scratch, as distillation methods face a significant domain gap between natural images and geospatial-specific sensors. The study concludes that multisensor pretraining is crucial for improving geospatial task performance and that further research is needed to address challenges such as incorporating temporal information into pretrained models.