LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

16 Jul 2024 | Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao
LHRS-Bot is a large multimodal language model (MLLM) specifically designed for remote sensing (RS) image understanding. The model is enhanced by leveraging globally available volunteered geographic information (VGI) and remote sensing (RS) images. To address the limitations of existing MLLMs in RS image understanding, the authors construct two datasets: LHRS-Align, a large-scale RS image-text dataset, and LHRS-Instruct, a multimodal instruction-following dataset. These datasets are used to train LHRS-Bot, which employs a novel multi-level vision-language alignment strategy and a curriculum learning method to achieve state-of-the-art performance in RS image understanding tasks. Additionally, the authors introduce LHRS-Bench, a benchmark for evaluating MLLMs in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain. The model is capable of detecting intricate objects, engaging in human conversation, and providing insights from visual information within RS images. The results show that LHRS-Bot outperforms existing MLLMs in various RS image understanding tasks, including classification, visual question answering (VQA), and visual grounding. The model also demonstrates strong performance on the LHRS-Bench benchmark, which includes 690 single-choice questions across five major evaluation dimensions and 11 sub-dimensions. The main contributions of this work include the development of LHRS-Align and LHRS-Instruct datasets, the introduction of LHRS-Bot, and the establishment of LHRS-Bench as a benchmark for evaluating MLLMs in the RS domain.LHRS-Bot is a large multimodal language model (MLLM) specifically designed for remote sensing (RS) image understanding. The model is enhanced by leveraging globally available volunteered geographic information (VGI) and remote sensing (RS) images. To address the limitations of existing MLLMs in RS image understanding, the authors construct two datasets: LHRS-Align, a large-scale RS image-text dataset, and LHRS-Instruct, a multimodal instruction-following dataset. These datasets are used to train LHRS-Bot, which employs a novel multi-level vision-language alignment strategy and a curriculum learning method to achieve state-of-the-art performance in RS image understanding tasks. Additionally, the authors introduce LHRS-Bench, a benchmark for evaluating MLLMs in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain. The model is capable of detecting intricate objects, engaging in human conversation, and providing insights from visual information within RS images. The results show that LHRS-Bot outperforms existing MLLMs in various RS image understanding tasks, including classification, visual question answering (VQA), and visual grounding. The model also demonstrates strong performance on the LHRS-Bench benchmark, which includes 690 single-choice questions across five major evaluation dimensions and 11 sub-dimensions. The main contributions of this work include the development of LHRS-Align and LHRS-Instruct datasets, the introduction of LHRS-Bot, and the establishment of LHRS-Bench as a benchmark for evaluating MLLMs in the RS domain.
Reach us at info@study.space