7 Jun 2024 | Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, Xiao Xiang Zhu
The paper introduces the Dynamic One-For-All (DOFA) model, a multimodal foundation model designed to integrate and adapt various Earth observation (EO) data modalities into a single framework. Inspired by neural plasticity, DOFA leverages a dynamic hypernetwork that adjusts network weights based on different wavelengths, enabling a versatile Transformer to excel across 14 distinct EO tasks, including sensors not seen during pretraining. The model's innovative design offers improved adaptability and performance in handling diverse EO data, showcasing its potential for more accurate and efficient Earth observation analysis. DOFA's architecture includes a wavelength-conditioned dynamic patch embedding layer, a shared vision Transformer backbone, and a masked image modeling strategy with a distillation loss to optimize performance. Experimental results demonstrate DOFA's superior handling of multimodal EO data, outperforming existing foundation models in most downstream tasks. The paper also discusses the model's adaptability to various data modalities and its potential applications in other domains such as medical image analysis, robotics, and climate modeling.The paper introduces the Dynamic One-For-All (DOFA) model, a multimodal foundation model designed to integrate and adapt various Earth observation (EO) data modalities into a single framework. Inspired by neural plasticity, DOFA leverages a dynamic hypernetwork that adjusts network weights based on different wavelengths, enabling a versatile Transformer to excel across 14 distinct EO tasks, including sensors not seen during pretraining. The model's innovative design offers improved adaptability and performance in handling diverse EO data, showcasing its potential for more accurate and efficient Earth observation analysis. DOFA's architecture includes a wavelength-conditioned dynamic patch embedding layer, a shared vision Transformer backbone, and a masked image modeling strategy with a distillation loss to optimize performance. Experimental results demonstrate DOFA's superior handling of multimodal EO data, outperforming existing foundation models in most downstream tasks. The paper also discusses the model's adaptability to various data modalities and its potential applications in other domains such as medical image analysis, robotics, and climate modeling.