The paper introduces FIT-RS, a large-scale instruction tuning dataset and model for remote sensing vision-language understanding. FIT-RS contains 1,800,851 instruction samples, covering basic and complex comprehension tasks, including relation reasoning, scene graph generation, and object reasoning. Based on FIT-RS, the authors build the FIT-RSFG benchmark and the FIT-RSRC benchmark to evaluate the fine-grained relation comprehension capabilities of large multi-modal models (LMMs). They propose SkySenseGPT, a comprehensive RSLMM that performs well on both public datasets and FIT-RSFG, surpassing existing RSLMMs. The dataset is available at https://github.com/Luo-Z13/SkySenseGPT.
The FIT-RS dataset is constructed from the STAR dataset, which includes high-resolution remote sensing images with detailed scene graph labels. The dataset includes 48 important target categories and 58 high-value semantic relationships, covering a wide range of complex tasks. The authors design various tasks, including detailed image and region captioning, visual question answering, multi-label scene classification, and complex comprehension tasks such as relation detection, object detection, and scene graph generation. These tasks are designed to enhance the fine-grained understanding of remote sensing scenes.
The FIT-RSRC benchmark is designed to evaluate the relation comprehension ability of LMMs in remote sensing scenes. It includes multiple-choice questions with high-quality distractor options and unanswerable questions. The benchmark is evaluated using the CircularEval strategy to ensure fairness. The authors' proposed SkySenseGPT achieves an overall accuracy of 55.5% in FIT-RSRC, surpassing existing LMMs and RSLMMs.
The SkySenseGPT model is based on a visual encoder, a multilayer perceptron as the multi-modal projector, and an LLM. The model is trained on the FIT-RS dataset and other public datasets. The model is evaluated on various benchmarks, including the FIT-RSFG and FIT-RSRC benchmarks, as well as public datasets. The results show that SkySenseGPT performs well on both basic and complex tasks, demonstrating its strong basic comprehension capabilities and ability to handle fine-grained tasks. The authors hope that FIT-RS can contribute to building more powerful RSLMMs.The paper introduces FIT-RS, a large-scale instruction tuning dataset and model for remote sensing vision-language understanding. FIT-RS contains 1,800,851 instruction samples, covering basic and complex comprehension tasks, including relation reasoning, scene graph generation, and object reasoning. Based on FIT-RS, the authors build the FIT-RSFG benchmark and the FIT-RSRC benchmark to evaluate the fine-grained relation comprehension capabilities of large multi-modal models (LMMs). They propose SkySenseGPT, a comprehensive RSLMM that performs well on both public datasets and FIT-RSFG, surpassing existing RSLMMs. The dataset is available at https://github.com/Luo-Z13/SkySenseGPT.
The FIT-RS dataset is constructed from the STAR dataset, which includes high-resolution remote sensing images with detailed scene graph labels. The dataset includes 48 important target categories and 58 high-value semantic relationships, covering a wide range of complex tasks. The authors design various tasks, including detailed image and region captioning, visual question answering, multi-label scene classification, and complex comprehension tasks such as relation detection, object detection, and scene graph generation. These tasks are designed to enhance the fine-grained understanding of remote sensing scenes.
The FIT-RSRC benchmark is designed to evaluate the relation comprehension ability of LMMs in remote sensing scenes. It includes multiple-choice questions with high-quality distractor options and unanswerable questions. The benchmark is evaluated using the CircularEval strategy to ensure fairness. The authors' proposed SkySenseGPT achieves an overall accuracy of 55.5% in FIT-RSRC, surpassing existing LMMs and RSLMMs.
The SkySenseGPT model is based on a visual encoder, a multilayer perceptron as the multi-modal projector, and an LLM. The model is trained on the FIT-RS dataset and other public datasets. The model is evaluated on various benchmarks, including the FIT-RSFG and FIT-RSRC benchmarks, as well as public datasets. The results show that SkySenseGPT performs well on both basic and complex tasks, demonstrating its strong basic comprehension capabilities and ability to handle fine-grained tasks. The authors hope that FIT-RS can contribute to building more powerful RSLMMs.