Large Language Models for Captioning and Retrieving Remote Sensing Images

Large Language Models for Captioning and Retrieving Remote Sensing Images

9 Feb 2024 | João Daniel Silva, João Magalhães, Devis Tuia, Senior Member, IEEE, Bruno Martins, Senior Member, IEEE
This paper introduces RS-CapRet, a Vision and Language model designed for remote sensing tasks, particularly image captioning and text-image retrieval. The model leverages a large decoder language model and image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge the gap between the image encoder and language decoder, simple linear layers are trained using examples from remote sensing image captioning datasets, with other parameters frozen. RS-CapRet can generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving state-of-the-art or competitive performance. Qualitative results demonstrate the model's ability to describe remote sensing images, retrieve images based on different queries, and process interleaved sequences of images and text in a dialogue manner. The paper also discusses related work, including image captioning and cross-modal retrieval methods, and provides experimental details and results.This paper introduces RS-CapRet, a Vision and Language model designed for remote sensing tasks, particularly image captioning and text-image retrieval. The model leverages a large decoder language model and image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge the gap between the image encoder and language decoder, simple linear layers are trained using examples from remote sensing image captioning datasets, with other parameters frozen. RS-CapRet can generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving state-of-the-art or competitive performance. Qualitative results demonstrate the model's ability to describe remote sensing images, retrieve images based on different queries, and process interleaved sequences of images and text in a dialogue manner. The paper also discusses related work, including image captioning and cross-modal retrieval methods, and provides experimental details and results.
Reach us at info@study.space
[slides] Large Language Models for Captioning and Retrieving Remote Sensing Images | StudySpace