15 Feb 2024 | Angelos Zavras, Dimitrios Michail, Begüm Demir, Ioannis Papoutsis
This paper proposes a method to align remote sensing (RS) imagery modalities with the visual and textual modalities of CLIP, a contrastive language-image pre-training model, to improve its zero-shot performance on RS tasks. The method consists of two stages: first, robustly fine-tuning CLIP to address distribution shifts between natural images and RS imagery, and second, cross-modal alignment of a pre-trained RS encoder with CLIP's visual and textual modalities. The approach is evaluated on RS imagery classification and cross-modal retrieval tasks, demonstrating significant performance gains across several RS benchmark datasets without relying on textual descriptions, task-specific parameters, or catastrophic forgetting. The results show that the proposed method outperforms existing CLIP-based models on RS tasks, particularly in cross-modal retrieval and zero-shot classification. The method is based on OpenAI's CLIP model and leverages its large-scale pre-training to enable effective alignment of RS imagery modalities with CLIP's visual and textual modalities. The paper also discusses related work, including domain-specific CLIP models and multi-modal CLIP-inspired models in remote sensing, and presents an ablation study of the proposed method. The results demonstrate the effectiveness of the proposed method in aligning RS imagery modalities with CLIP's visual and textual modalities, enabling a rich set of cross-modal retrieval and text-based zero-shot downstream tasks. The paper concludes that the proposed method provides a blueprint for the development of resource-efficient RS vision-language models.This paper proposes a method to align remote sensing (RS) imagery modalities with the visual and textual modalities of CLIP, a contrastive language-image pre-training model, to improve its zero-shot performance on RS tasks. The method consists of two stages: first, robustly fine-tuning CLIP to address distribution shifts between natural images and RS imagery, and second, cross-modal alignment of a pre-trained RS encoder with CLIP's visual and textual modalities. The approach is evaluated on RS imagery classification and cross-modal retrieval tasks, demonstrating significant performance gains across several RS benchmark datasets without relying on textual descriptions, task-specific parameters, or catastrophic forgetting. The results show that the proposed method outperforms existing CLIP-based models on RS tasks, particularly in cross-modal retrieval and zero-shot classification. The method is based on OpenAI's CLIP model and leverages its large-scale pre-training to enable effective alignment of RS imagery modalities with CLIP's visual and textual modalities. The paper also discusses related work, including domain-specific CLIP models and multi-modal CLIP-inspired models in remote sensing, and presents an ablation study of the proposed method. The results demonstrate the effectiveness of the proposed method in aligning RS imagery modalities with CLIP's visual and textual modalities, enabling a rich set of cross-modal retrieval and text-based zero-shot downstream tasks. The paper concludes that the proposed method provides a blueprint for the development of resource-efficient RS vision-language models.