CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

10 Jun 2024 | Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, Huaxiu Yao
The paper introduces CARES, a comprehensive benchmark designed to evaluate the trustworthiness of Medical Large Vision Language Models (Med-LVLMs) across five dimensions: trustfulness, fairness, safety, privacy, and robustness. CARES comprises 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. The evaluation reveals that existing Med-LVLMs exhibit significant issues in trustworthiness, including factual inaccuracies, lack of fairness across demographic groups, vulnerability to attacks, and poor privacy awareness. The paper also discusses the construction of the benchmark, the evaluation setup, and detailed results for each dimension. The findings highlight the need for further standardization and the development of more reliable Med-LVLMs to ensure trustworthy and effective healthcare applications.The paper introduces CARES, a comprehensive benchmark designed to evaluate the trustworthiness of Medical Large Vision Language Models (Med-LVLMs) across five dimensions: trustfulness, fairness, safety, privacy, and robustness. CARES comprises 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. The evaluation reveals that existing Med-LVLMs exhibit significant issues in trustworthiness, including factual inaccuracies, lack of fairness across demographic groups, vulnerability to attacks, and poor privacy awareness. The paper also discusses the construction of the benchmark, the evaluation setup, and detailed results for each dimension. The findings highlight the need for further standardization and the development of more reliable Med-LVLMs to ensure trustworthy and effective healthcare applications.
Reach us at info@study.space