This paper evaluates the performance of instruction-following Vision-Language Models (VLMs), particularly GPT-4V, on Earth observation (EO) tasks such as scene understanding, localization and counting, and change detection. The study aims to assess how well these models can be applied to EO data, which is predominantly satellite and aerial imagery. The research introduces a comprehensive benchmark to evaluate VLMs on these tasks, including urban monitoring, disaster relief, land use, and conservation.
The study finds that while VLMs like GPT-4V excel in tasks requiring open-ended reasoning and image captioning, they struggle with spatial reasoning tasks such as object localization and counting. GPT-4V performs well in scene understanding tasks, such as recognizing landmarks and generating captions, but its performance on counting tasks is poor. It also fails to accurately detect changes in building damage after disasters.
The benchmark includes datasets for evaluating scene understanding, localization and counting, and change detection. For scene understanding, GPT-4V achieves high accuracy in landmark recognition and image captioning but struggles with land cover classification due to label ambiguity. For localization and counting, GPT-4V performs poorly, with low accuracy in counting small objects and misclassifying similar classes. For change detection, GPT-4V fails to accurately categorize damaged buildings.
The study highlights the limitations of current VLMs in handling EO data, particularly in tasks requiring precise spatial reasoning and object counting. It suggests that further research is needed to improve the spatial awareness and change detection capabilities of VLMs. The benchmark is made publicly available for model evaluation.This paper evaluates the performance of instruction-following Vision-Language Models (VLMs), particularly GPT-4V, on Earth observation (EO) tasks such as scene understanding, localization and counting, and change detection. The study aims to assess how well these models can be applied to EO data, which is predominantly satellite and aerial imagery. The research introduces a comprehensive benchmark to evaluate VLMs on these tasks, including urban monitoring, disaster relief, land use, and conservation.
The study finds that while VLMs like GPT-4V excel in tasks requiring open-ended reasoning and image captioning, they struggle with spatial reasoning tasks such as object localization and counting. GPT-4V performs well in scene understanding tasks, such as recognizing landmarks and generating captions, but its performance on counting tasks is poor. It also fails to accurately detect changes in building damage after disasters.
The benchmark includes datasets for evaluating scene understanding, localization and counting, and change detection. For scene understanding, GPT-4V achieves high accuracy in landmark recognition and image captioning but struggles with land cover classification due to label ambiguity. For localization and counting, GPT-4V performs poorly, with low accuracy in counting small objects and misclassifying similar classes. For change detection, GPT-4V fails to accurately categorize damaged buildings.
The study highlights the limitations of current VLMs in handling EO data, particularly in tasks requiring precise spatial reasoning and object counting. It suggests that further research is needed to improve the spatial awareness and change detection capabilities of VLMs. The benchmark is made publicly available for model evaluation.