18 Jun 2024 | Xiang Li, Jian Ding, Mohamed Elhoseiny
The paper introduces VRSBench, a new benchmark dataset designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Existing datasets in remote sensing often lack detailed object information, suffer from inadequate quality control, or are tailored to single tasks. VRSBench addresses these limitations by providing 29,614 images with detailed captions, 52,472 object references, and 123,221 question-answer pairs. The dataset facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks, including image captioning, visual grounding, and visual question answering. The paper evaluates state-of-the-art models on this benchmark and highlights the need for more advanced vision-language models to handle the complexities of remote sensing images. The data and code are available at https://vrsbench.github.io.The paper introduces VRSBench, a new benchmark dataset designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Existing datasets in remote sensing often lack detailed object information, suffer from inadequate quality control, or are tailored to single tasks. VRSBench addresses these limitations by providing 29,614 images with detailed captions, 52,472 object references, and 123,221 question-answer pairs. The dataset facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks, including image captioning, visual grounding, and visual question answering. The paper evaluates state-of-the-art models on this benchmark and highlights the need for more advanced vision-language models to handle the complexities of remote sensing images. The data and code are available at https://vrsbench.github.io.