SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

18 Jan 2024 | Yang Zhan, Zhitong Xiong, Yuan Yuan
SkyEyeGPT is a unified multi-modal large language model (MLLM) designed specifically for remote sensing (RS) vision-language understanding. The authors address the lack of instruction data in the RS domain by curating a high-quality RS multi-modal instruction-following dataset, SkyEye-968k, consisting of 968k samples. This dataset includes single-task and multi-task conversation instructions. SkyEyeGPT's architecture consists of a visual encoder, an alignment layer, and an LLM-based decoder. The visual encoder projects RS visual features to the language domain, which are then fed into the LLM-based decoder along with task-specific instructions to predict answers for open-ended RS tasks. A two-stage tuning method is designed to enhance instruction-following and multi-turn dialogue abilities. Experiments on eight RS vision-language datasets demonstrate SkyEyeGPT's superior performance in image-level and region-level tasks, such as captioning and visual grounding. The model shows encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset are available for public use.SkyEyeGPT is a unified multi-modal large language model (MLLM) designed specifically for remote sensing (RS) vision-language understanding. The authors address the lack of instruction data in the RS domain by curating a high-quality RS multi-modal instruction-following dataset, SkyEye-968k, consisting of 968k samples. This dataset includes single-task and multi-task conversation instructions. SkyEyeGPT's architecture consists of a visual encoder, an alignment layer, and an LLM-based decoder. The visual encoder projects RS visual features to the language domain, which are then fed into the LLM-based decoder along with task-specific instructions to predict answers for open-ended RS tasks. A two-stage tuning method is designed to enhance instruction-following and multi-turn dialogue abilities. Experiments on eight RS vision-language datasets demonstrate SkyEyeGPT's superior performance in image-level and region-level tasks, such as captioning and visual grounding. The model shows encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset are available for public use.
Reach us at info@study.space
[slides and audio] SkyEyeGPT%3A Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model