15 Feb 2024 | Angelos Zavras, Dimitrios Michail, Begüm Demir, Ioannis Papoutsis
This paper addresses the challenges of applying Contrastive Language-Image Pre-training (CLIP) to Remote Sensing (RS) imagery, which often exhibit different distributions and rely on complementary modalities beyond RGB. The authors propose a two-stage methodology to align distinct RS imagery modalities with the visual and textual modalities of CLIP. The first stage involves robust fine-tuning of CLIP using RS data RGB composites to handle distribution shifts. The second stage aligns a pre-trained RS encoder with the visual and textual modalities of CLIP through cross-modal alignment. The method is evaluated on RS imagery classification and cross-modal retrieval tasks, demonstrating significant performance gains across several benchmark datasets without relying on textual descriptions, introducing task-specific parameters, or training from scratch. The contributions include a novel methodology for cross-modal alignment, extensive benchmarking, and an improved CLIP model that enhances zero-shot performance on RS imagery tasks.This paper addresses the challenges of applying Contrastive Language-Image Pre-training (CLIP) to Remote Sensing (RS) imagery, which often exhibit different distributions and rely on complementary modalities beyond RGB. The authors propose a two-stage methodology to align distinct RS imagery modalities with the visual and textual modalities of CLIP. The first stage involves robust fine-tuning of CLIP using RS data RGB composites to handle distribution shifts. The second stage aligns a pre-trained RS encoder with the visual and textual modalities of CLIP through cross-modal alignment. The method is evaluated on RS imagery classification and cross-modal retrieval tasks, demonstrating significant performance gains across several benchmark datasets without relying on textual descriptions, introducing task-specific parameters, or training from scratch. The contributions include a novel methodology for cross-modal alignment, extensive benchmarking, and an improved CLIP model that enhances zero-shot performance on RS imagery tasks.