18 Jun 2024 | Xiang Li, Jian Ding, Mohamed Elhoseiny
VRSBench is a new benchmark dataset designed to advance the development of general-purpose, large-scale vision-language models for remote sensing image understanding. It contains 29,614 remote sensing images, each with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. The dataset facilitates the training and evaluation of vision-language models across a wide range of remote sensing image understanding tasks. It includes detailed captions, object referring, and visual question answering annotations. The dataset was created using a semi-automatic data collection pipeline involving object attribute extraction, prompt engineering, GPT-4 inference, and human verification. The dataset provides large-scale human-verified annotations with rich object details and diverse question-answer pairs. It includes three benchmarks: detailed image captioning, visual grounding, and visual question answering. The dataset was evaluated for three vision-language tasks: image captioning, visual grounding, and visual question answering. The results show that the VRSBench dataset significantly contributes to the development of advanced vision-language models in remote sensing. The dataset is available at https://vrsbench.github.io.VRSBench is a new benchmark dataset designed to advance the development of general-purpose, large-scale vision-language models for remote sensing image understanding. It contains 29,614 remote sensing images, each with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. The dataset facilitates the training and evaluation of vision-language models across a wide range of remote sensing image understanding tasks. It includes detailed captions, object referring, and visual question answering annotations. The dataset was created using a semi-automatic data collection pipeline involving object attribute extraction, prompt engineering, GPT-4 inference, and human verification. The dataset provides large-scale human-verified annotations with rich object details and diverse question-answer pairs. It includes three benchmarks: detailed image captioning, visual grounding, and visual question answering. The dataset was evaluated for three vision-language tasks: image captioning, visual grounding, and visual question answering. The results show that the VRSBench dataset significantly contributes to the development of advanced vision-language models in remote sensing. The dataset is available at https://vrsbench.github.io.