OmniSat: Self-Supervised Modality Fusion for Earth Observation

OmniSat: Self-Supervised Modality Fusion for Earth Observation

17 Jul 2024 | Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu
OmniSat is a novel self-supervised model for fusing diverse Earth Observation (EO) modalities into rich, unlabeled features by leveraging their alignment. The model is designed to handle multiple EO data types, including very high-resolution (VHR) images, optical and SAR time series, and other modalities. To evaluate its effectiveness, two new multimodal datasets were created by augmenting existing ones with additional modalities. These datasets allow for the assessment of OmniSat's ability to process an arbitrary number of inputs with varying resolutions and natures. OmniSat demonstrates superior performance in three downstream tasks: forestry, land cover classification, and crop mapping. It achieves state-of-the-art results in both semi- and fully supervised settings, and even when only one modality is available for inference. The model's multimodal pretraining scheme enhances performance by learning rich, cross-modal representations without relying on annotations. The architecture of OmniSat includes a tokenization process that aligns different modalities spatially, followed by encoder-decoder structures for each modality. A modality combining network integrates these representations into a unified multimodal representation. The model uses a contrastive objective to encourage spatial consistency across modalities and a reconstruction objective to ensure that the multimodal representations can be accurately decoded. OmniSat's self-supervised learning approach allows it to learn expressive features without labels, making it suitable for applications where data annotation is limited. The model's ability to handle diverse EO data types and its effectiveness in both supervised and unsupervised settings highlight its potential for improving EO analysis. The code and datasets are available for further research and development.OmniSat is a novel self-supervised model for fusing diverse Earth Observation (EO) modalities into rich, unlabeled features by leveraging their alignment. The model is designed to handle multiple EO data types, including very high-resolution (VHR) images, optical and SAR time series, and other modalities. To evaluate its effectiveness, two new multimodal datasets were created by augmenting existing ones with additional modalities. These datasets allow for the assessment of OmniSat's ability to process an arbitrary number of inputs with varying resolutions and natures. OmniSat demonstrates superior performance in three downstream tasks: forestry, land cover classification, and crop mapping. It achieves state-of-the-art results in both semi- and fully supervised settings, and even when only one modality is available for inference. The model's multimodal pretraining scheme enhances performance by learning rich, cross-modal representations without relying on annotations. The architecture of OmniSat includes a tokenization process that aligns different modalities spatially, followed by encoder-decoder structures for each modality. A modality combining network integrates these representations into a unified multimodal representation. The model uses a contrastive objective to encourage spatial consistency across modalities and a reconstruction objective to ensure that the multimodal representations can be accurately decoded. OmniSat's self-supervised learning approach allows it to learn expressive features without labels, making it suitable for applications where data annotation is limited. The model's ability to handle diverse EO data types and its effectiveness in both supervised and unsupervised settings highlight its potential for improving EO analysis. The code and datasets are available for further research and development.
Reach us at info@study.space