29 Mar 2024 | Chao Pang1*, Jiang Wu2*, Jiayu Li1, Yi Liu1, Jiaxing Sun5, Weijia Li3, Xingxing Weng1, Shuai Wang4, Litong Feng4, Gui-Song Xia1,5,6,†, and Conghui He2,4†
The paper introduces H²RSVLM, a novel remote sensing vision language model that combines helpfulness and honesty. The authors address the limitations of existing remote sensing-specific Vision Language Models (RSVLMs) by constructing two large-scale datasets: HqDC-1.4M and RSSA. HqDC-1.4M contains 1.4 million image-caption pairs, enhancing the model's understanding and spatial perception abilities. RSSA is the first dataset designed to improve the self-awareness of RSVLMs, enabling them to recognize and refuse to answer unanswerable questions. The H²RSVLM model is trained using these datasets and demonstrates superior performance on various remote sensing tasks, including scene classification, visual question answering, and visual grounding. The model also shows strong honesty by avoiding hallucinations when faced with unanswerable questions. The paper includes detailed experimental results and ablation studies to validate the effectiveness of the proposed datasets and model.The paper introduces H²RSVLM, a novel remote sensing vision language model that combines helpfulness and honesty. The authors address the limitations of existing remote sensing-specific Vision Language Models (RSVLMs) by constructing two large-scale datasets: HqDC-1.4M and RSSA. HqDC-1.4M contains 1.4 million image-caption pairs, enhancing the model's understanding and spatial perception abilities. RSSA is the first dataset designed to improve the self-awareness of RSVLMs, enabling them to recognize and refuse to answer unanswerable questions. The H²RSVLM model is trained using these datasets and demonstrates superior performance on various remote sensing tasks, including scene classification, visual question answering, and visual grounding. The model also shows strong honesty by avoiding hallucinations when faced with unanswerable questions. The paper includes detailed experimental results and ablation studies to validate the effectiveness of the proposed datasets and model.