[slides and audio] Good at captioning%2C bad at counting%3A Benchmarking GPT-4V on Earth observation data

This paper evaluates the capabilities of large Vision-Language Models (VLMs) on Earth Observation (EO) data, focusing on scene understanding, localization and counting, and change detection tasks. The authors propose a comprehensive benchmark to assess these models' performance in real-world applications such as urban monitoring, disaster relief, land use, and conservation. They find that while state-of-the-art VLMs like GPT-4V excel in open-ended tasks like image captioning and location understanding due to their extensive world knowledge, their spatial reasoning abilities are limited, making them less effective for object localization and counting tasks. The benchmark includes datasets for aerial landmark recognition, remote sensing image captioning, land cover classification, object localization, and counting, as well as change detection. The results highlight the need for further improvements in spatial awareness and change understanding, and the authors suggest potential directions for future research, including model architecture, pretraining methodologies, and data and model interfaces. The benchmark will be made publicly available to facilitate model evaluation and encourage further development in this area.This paper evaluates the capabilities of large Vision-Language Models (VLMs) on Earth Observation (EO) data, focusing on scene understanding, localization and counting, and change detection tasks. The authors propose a comprehensive benchmark to assess these models' performance in real-world applications such as urban monitoring, disaster relief, land use, and conservation. They find that while state-of-the-art VLMs like GPT-4V excel in open-ended tasks like image captioning and location understanding due to their extensive world knowledge, their spatial reasoning abilities are limited, making them less effective for object localization and counting tasks. The benchmark includes datasets for aerial landmark recognition, remote sensing image captioning, land cover classification, object localization, and counting, as well as change detection. The results highlight the need for further improvements in spatial awareness and change understanding, and the authors suggest potential directions for future research, including model architecture, pretraining methodologies, and data and model interfaces. The benchmark will be made publicly available to facilitate model evaluation and encourage further development in this area.

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

31 Jan 2024 | Chenhui Zhang, Sherrie Wang