29 Jul 2024 | Vishal Nedungadi, Ankit Karirya, Stefan Oehmcke, Serge Belongie, Christian Igel, and Nico Lang
MMEarth is a global multi-modal pretraining dataset for geospatial representation learning, containing data from 1.2 million locations across 12 modalities. The dataset includes both pixel-level and image-level data, such as optical satellite images, SAR data, elevation maps, and scalar values like biome and temperature. The dataset is designed to enable the learning of general-purpose representations for optical satellite images, particularly from the Sentinel-2 mission, which are useful for various downstream tasks including image classification and semantic segmentation.
The paper proposes a Multi-Pretext Masked Autoencoder (MP-MAE) approach, which extends the ConvNeXt V2 architecture with multi-modal pretext tasks. This approach outperforms both MAEs pretrained on ImageNet and those pretrained on domain-specific satellite images. The MP-MAE approach improves both fine-tuning and linear probing performance, with linear probing benefiting particularly from the multi-modal pretext tasks. The results show that pretraining with multi-modal pretext tasks leads to better label efficiency and parameter efficiency, which are crucial for global-scale applications.
The dataset is used to explore the potential of multi-modal data for improving representation learning. The MP-MAE approach is evaluated on several downstream tasks, including image classification and semantic segmentation, and shows improved performance compared to other methods. The results also demonstrate that multi-modal pretext tasks improve the generalization of representations to new downstream tasks not known at pretraining time.
The paper also discusses related work in masked image modelling and multi-modal representation learning, highlighting the importance of using multi-modal data for learning better representations. The study shows that multi-modal data can be effectively used for representation learning in Earth observation, and that the MP-MAE approach is a promising method for this task. The results indicate that the MP-MAE approach is effective in learning representations that can be used for a wide range of downstream tasks, including crop type, land cover, and climate zone classification.MMEarth is a global multi-modal pretraining dataset for geospatial representation learning, containing data from 1.2 million locations across 12 modalities. The dataset includes both pixel-level and image-level data, such as optical satellite images, SAR data, elevation maps, and scalar values like biome and temperature. The dataset is designed to enable the learning of general-purpose representations for optical satellite images, particularly from the Sentinel-2 mission, which are useful for various downstream tasks including image classification and semantic segmentation.
The paper proposes a Multi-Pretext Masked Autoencoder (MP-MAE) approach, which extends the ConvNeXt V2 architecture with multi-modal pretext tasks. This approach outperforms both MAEs pretrained on ImageNet and those pretrained on domain-specific satellite images. The MP-MAE approach improves both fine-tuning and linear probing performance, with linear probing benefiting particularly from the multi-modal pretext tasks. The results show that pretraining with multi-modal pretext tasks leads to better label efficiency and parameter efficiency, which are crucial for global-scale applications.
The dataset is used to explore the potential of multi-modal data for improving representation learning. The MP-MAE approach is evaluated on several downstream tasks, including image classification and semantic segmentation, and shows improved performance compared to other methods. The results also demonstrate that multi-modal pretext tasks improve the generalization of representations to new downstream tasks not known at pretraining time.
The paper also discusses related work in masked image modelling and multi-modal representation learning, highlighting the importance of using multi-modal data for learning better representations. The study shows that multi-modal data can be effectively used for representation learning in Earth observation, and that the MP-MAE approach is a promising method for this task. The results indicate that the MP-MAE approach is effective in learning representations that can be used for a wide range of downstream tasks, including crop type, land cover, and climate zone classification.