[slides and audio] VHM%3A Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

The paper introduces H²RSVLM, a helpful and honest remote sensing large vision language model (VLM) designed to address the limitations of existing VLMs in the remote sensing (RS) domain. Current VLMs struggle with RS imagery due to the unique characteristics of RS images and limited spatial perception. To improve performance, the authors created HqDC-1.4M, a large-scale dataset of 1.4 million image-caption pairs with high-quality, detailed descriptions. This dataset enhances the understanding and spatial perception of RSVLMs, improving their ability to recognize and count objects. Additionally, the authors developed RSSA, the first dataset aimed at enhancing the self-awareness of RSVLMs, which includes both answerable and unanswerable questions to improve honesty and reduce hallucinations. Based on these datasets, the authors proposed H²RSVLM, which achieves outstanding performance on multiple RS datasets and can recognize and refuse to answer unanswerable questions, effectively mitigating incorrect outputs. The code, data, and model weights are available at https://github.com/opendatalab/H2RSVLM. The paper also discusses related work, including other RS VLMs and large-scale RS vision-language datasets, and presents quantitative and qualitative results demonstrating the effectiveness of H²RSVLM in RS tasks. The results show that H²RSVLM outperforms other models in scene classification, visual question answering, and visual grounding tasks, and demonstrates strong self-awareness capabilities. The authors conclude that H²RSVLM is a significant advancement in RS VLMs, offering improved helpfulness and honesty in remote sensing applications.The paper introduces H²RSVLM, a helpful and honest remote sensing large vision language model (VLM) designed to address the limitations of existing VLMs in the remote sensing (RS) domain. Current VLMs struggle with RS imagery due to the unique characteristics of RS images and limited spatial perception. To improve performance, the authors created HqDC-1.4M, a large-scale dataset of 1.4 million image-caption pairs with high-quality, detailed descriptions. This dataset enhances the understanding and spatial perception of RSVLMs, improving their ability to recognize and count objects. Additionally, the authors developed RSSA, the first dataset aimed at enhancing the self-awareness of RSVLMs, which includes both answerable and unanswerable questions to improve honesty and reduce hallucinations. Based on these datasets, the authors proposed H²RSVLM, which achieves outstanding performance on multiple RS datasets and can recognize and refuse to answer unanswerable questions, effectively mitigating incorrect outputs. The code, data, and model weights are available at https://github.com/opendatalab/H2RSVLM. The paper also discusses related work, including other RS VLMs and large-scale RS vision-language datasets, and presents quantitative and qualitative results demonstrating the effectiveness of H²RSVLM in RS tasks. The results show that H²RSVLM outperforms other models in scene classification, visual question answering, and visual grounding tasks, and demonstrates strong self-awareness capabilities. The authors conclude that H²RSVLM is a significant advancement in RS VLMs, offering improved helpfulness and honesty in remote sensing applications.

H²RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

29 Mar 2024 | Chao Pang, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Xingxing Weng, Shuai Wang, Litong Feng, Gui-Song Xia, Conghui He