7 Jun 2024 | Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, Xiao Xiang Zhu
This paper introduces a neural plasticity-inspired multimodal foundation model, DOFA, for Earth observation (EO). DOFA leverages the concept of neural plasticity in brain science to integrate various data modalities into a single framework that can adaptively process diverse EO data. The model uses a dynamic hypernetwork that adjusts to different wavelengths, enabling a single versatile Transformer to jointly train on data from five sensors and excel across 14 distinct EO tasks, including sensors never seen during pretraining. DOFA's innovative design offers a promising leap toward more accurate, efficient, and unified EO analysis, showcasing remarkable adaptability and performance in harnessing the potential of multimodal EO data.
Earth observation through satellite remote sensing enables deeper modeling and understanding of the Earth system. This pursuit is supported by the increasing deployment of satellites and sensors, each designed to capture distinct aspects of the Earth's surface at varied spatial, spectral, and temporal resolutions. The advancement in observational technologies has unleashed a deluge of data surpassing hundreds of petabytes across the atmosphere, ocean, land, and cryosphere, offering unprecedented insights into various physical and biological processes. The data from such diverse missions as Landsat, Sentinels, MODIS, EnMAP, Gaofen, and NAIP presents a rich yet complex mosaic of the Earth's surface. Interpreting the multifaceted EO data through artificial intelligence can unlock remarkable possibilities for understanding complex environmental processes, from climate monitoring to disaster response and sustainable development.
Traditional deep learning models utilize annotated datasets from these diverse data sources to train task-specific models. However, this paradigm necessitates substantial human efforts in dataset collection and annotation, alongside significant computational resources for model training and evaluation. In response to these challenges, foundation models (FMs) have gained traction and popularity. Notable examples of foundation models include large language models like LLaMA, GPT-3, and ChatGPT, as well as prominent visual models like CLIP, BLIP, and SAM. The essential advantage of such models is their ability to be adapted for specific downstream tasks with relatively fewer annotated data points, benefiting from the general feature representations learned from massive unlabelled data.
One of the key challenges in developing EO foundation models is coping with multisensor data. Earlier methods were typically designed to specialize in a single data source or a specific range of spatial and spectral resolutions. For example, existing pretrained models like GFM, Scale-MAE, and Cross-scale-MAE are pretrained for optical data. FG-MAE and SatMAE are developed for multi-spectral Sentinel-2 data, while SSL4EO-L is designed for image data from Landsat. CROMA designs two unimodal encoders to encode multi-spectral and synthetic aperture radar (SAR) data. A cross-modal radar-optical transformer is utilized to learn unified deep representations. DeCUR is a bimodal self-supervised model that decouples the unique and common representations between two differentThis paper introduces a neural plasticity-inspired multimodal foundation model, DOFA, for Earth observation (EO). DOFA leverages the concept of neural plasticity in brain science to integrate various data modalities into a single framework that can adaptively process diverse EO data. The model uses a dynamic hypernetwork that adjusts to different wavelengths, enabling a single versatile Transformer to jointly train on data from five sensors and excel across 14 distinct EO tasks, including sensors never seen during pretraining. DOFA's innovative design offers a promising leap toward more accurate, efficient, and unified EO analysis, showcasing remarkable adaptability and performance in harnessing the potential of multimodal EO data.
Earth observation through satellite remote sensing enables deeper modeling and understanding of the Earth system. This pursuit is supported by the increasing deployment of satellites and sensors, each designed to capture distinct aspects of the Earth's surface at varied spatial, spectral, and temporal resolutions. The advancement in observational technologies has unleashed a deluge of data surpassing hundreds of petabytes across the atmosphere, ocean, land, and cryosphere, offering unprecedented insights into various physical and biological processes. The data from such diverse missions as Landsat, Sentinels, MODIS, EnMAP, Gaofen, and NAIP presents a rich yet complex mosaic of the Earth's surface. Interpreting the multifaceted EO data through artificial intelligence can unlock remarkable possibilities for understanding complex environmental processes, from climate monitoring to disaster response and sustainable development.
Traditional deep learning models utilize annotated datasets from these diverse data sources to train task-specific models. However, this paradigm necessitates substantial human efforts in dataset collection and annotation, alongside significant computational resources for model training and evaluation. In response to these challenges, foundation models (FMs) have gained traction and popularity. Notable examples of foundation models include large language models like LLaMA, GPT-3, and ChatGPT, as well as prominent visual models like CLIP, BLIP, and SAM. The essential advantage of such models is their ability to be adapted for specific downstream tasks with relatively fewer annotated data points, benefiting from the general feature representations learned from massive unlabelled data.
One of the key challenges in developing EO foundation models is coping with multisensor data. Earlier methods were typically designed to specialize in a single data source or a specific range of spatial and spectral resolutions. For example, existing pretrained models like GFM, Scale-MAE, and Cross-scale-MAE are pretrained for optical data. FG-MAE and SatMAE are developed for multi-spectral Sentinel-2 data, while SSL4EO-L is designed for image data from Landsat. CROMA designs two unimodal encoders to encode multi-spectral and synthetic aperture radar (SAR) data. A cross-modal radar-optical transformer is utilized to learn unified deep representations. DeCUR is a bimodal self-supervised model that decouples the unique and common representations between two different