LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

16 Jul 2024 | Dilkat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, Pengfeng Xiao
The paper introduces LHRS-Bot, a multimodal large language model (MLLM) specifically designed for remote sensing (RS) image understanding. To address the challenges in RS image understanding, the authors construct a large-scale RS image-text dataset, LHRS-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging volunteered geographic information (VGI) and globally available RS images. LHRS-Align contains 1.15 million meaningful and high-quality RS image-text pairs, while LHRS-Instruct includes complex visual reasoning data generated by GPT-4. LHRS-Bot employs a novel multi-level vision-language alignment strategy and a curriculum learning method to effectively summarize multi-level visual representations and enhance its understanding of RS images. Comprehensive experiments demonstrate that LHRS-Bot outperforms existing MLLMs in various RS image understanding tasks, including image classification, visual question answering (VQA), and visual grounding. Additionally, the authors introduce LHRS-Bench, a benchmark for evaluating MLLMs' abilities in RS image understanding, which includes 690 single-choice questions covering five top-level evaluation dimensions and 11 fine-grained categories. The main contributions of the work include the creation of LHRS-Align and LHRS-Instruct, the development of LHRS-Bot, and the establishment of LHRS-Bench.The paper introduces LHRS-Bot, a multimodal large language model (MLLM) specifically designed for remote sensing (RS) image understanding. To address the challenges in RS image understanding, the authors construct a large-scale RS image-text dataset, LHRS-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging volunteered geographic information (VGI) and globally available RS images. LHRS-Align contains 1.15 million meaningful and high-quality RS image-text pairs, while LHRS-Instruct includes complex visual reasoning data generated by GPT-4. LHRS-Bot employs a novel multi-level vision-language alignment strategy and a curriculum learning method to effectively summarize multi-level visual representations and enhance its understanding of RS images. Comprehensive experiments demonstrate that LHRS-Bot outperforms existing MLLMs in various RS image understanding tasks, including image classification, visual question answering (VQA), and visual grounding. Additionally, the authors introduce LHRS-Bench, a benchmark for evaluating MLLMs' abilities in RS image understanding, which includes 690 single-choice questions covering five top-level evaluation dimensions and 11 fine-grained categories. The main contributions of the work include the creation of LHRS-Align and LHRS-Instruct, the development of LHRS-Bot, and the establishment of LHRS-Bench.
Reach us at info@study.space