Large Language Models for Captioning and Retrieving Remote Sensing Images

Large Language Models for Captioning and Retrieving Remote Sensing Images

9 Feb 2024 | João Daniel Silva, João Magalhães, Devis Tuia, Bruno Martins
This paper presents RS-CapRet, a Vision and Language model for remote sensing tasks, specifically image captioning and text-image retrieval. The model leverages a large decoder language model and image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge the image encoder and language decoder, the authors propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving state-of-the-art or competitive performance with existing methods. Qualitative results show that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner. The paper discusses the challenges of applying vision and language models to the remote sensing domain, including the relatively small size of available datasets and models. Previous methods have used encoder-decoder architectures with CNNs as image encoders and LSTMs for text generation. However, recent trends have included more recent Transformer-based architectures and more capable vision encoders for these tasks. The authors propose RS-CapRet, which combines the high capabilities of a large language model with an image encoder adapted to the remote sensing domain. Instead of fine-tuning the entire LLM and the vision encoder, the authors choose to freeze their parameters and only train a linear layer that is used to project the visual embeddings to the input embedding space of the decoder model. The model is trained on an image-captioning task and a contrastive learning method between the projection of the visual encoders and the [RET] token embedding. The [RET] token is added to the vocabulary of the model and linear layers are used to project this token and the visual embeddings into a common feature space. This allows RS-CapRet to retrieve an image with the most similar embedding to the embedding of the [RET] token. The authors also motivate the choice of the vision encoder by experimenting with different models measuring their performance on cross-modal retrieval tasks as a proxy to measure the quality of the image embeddings. The paper evaluates RS-CapRet on several remote sensing image captioning datasets, including RSICD, UCM-Captions, Sydney-Captions, and NWPU-Captions. The results show that RS-CapRet achieves state-of-the-art or competitive performance in image captioning and text-image retrieval. The model is also able to handle conversations about remote sensing images, describing input images, and obtaining other given user queries. The authors conclude that RS-CapRet is a promising approach for remote sensing tasks and suggest future work in extending the model's capabilities for vision and language tasks, including visual question answering and visual grounding.This paper presents RS-CapRet, a Vision and Language model for remote sensing tasks, specifically image captioning and text-image retrieval. The model leverages a large decoder language model and image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge the image encoder and language decoder, the authors propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving state-of-the-art or competitive performance with existing methods. Qualitative results show that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner. The paper discusses the challenges of applying vision and language models to the remote sensing domain, including the relatively small size of available datasets and models. Previous methods have used encoder-decoder architectures with CNNs as image encoders and LSTMs for text generation. However, recent trends have included more recent Transformer-based architectures and more capable vision encoders for these tasks. The authors propose RS-CapRet, which combines the high capabilities of a large language model with an image encoder adapted to the remote sensing domain. Instead of fine-tuning the entire LLM and the vision encoder, the authors choose to freeze their parameters and only train a linear layer that is used to project the visual embeddings to the input embedding space of the decoder model. The model is trained on an image-captioning task and a contrastive learning method between the projection of the visual encoders and the [RET] token embedding. The [RET] token is added to the vocabulary of the model and linear layers are used to project this token and the visual embeddings into a common feature space. This allows RS-CapRet to retrieve an image with the most similar embedding to the embedding of the [RET] token. The authors also motivate the choice of the vision encoder by experimenting with different models measuring their performance on cross-modal retrieval tasks as a proxy to measure the quality of the image embeddings. The paper evaluates RS-CapRet on several remote sensing image captioning datasets, including RSICD, UCM-Captions, Sydney-Captions, and NWPU-Captions. The results show that RS-CapRet achieves state-of-the-art or competitive performance in image captioning and text-image retrieval. The model is also able to handle conversations about remote sensing images, describing input images, and obtaining other given user queries. The authors conclude that RS-CapRet is a promising approach for remote sensing tasks and suggest future work in extending the model's capabilities for vision and language tasks, including visual question answering and visual grounding.
Reach us at info@study.space