10 Jun 2024 | Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, Huaxiu Yao
CARES is a comprehensive benchmark for evaluating the trustworthiness of Medical Large Vision Language Models (Med-LVLMs) across five dimensions: trustfulness, fairness, safety, privacy, and robustness. The benchmark includes 41,000 question-answer pairs across 16 medical image modalities and 27 anatomical regions. The study reveals that Med-LVLMs often exhibit factual inaccuracies, lack fairness across demographic groups, are vulnerable to attacks, and fail to protect patient privacy. The models also show poor uncertainty estimation and overconfidence, leading to potential misdiagnoses. CARES evaluates four open-source Med-LVLMs and two advanced general LVLMs, finding that while LLaVA-Med performs best in factuality, it shows excessive caution and low accuracy in some cases. The benchmark also highlights significant disparities in model performance across different age, gender, and racial groups. Med-LVLMs are susceptible to jailbreaking, overcautiousness, and toxicity, with LLaVA-Med showing the strongest resistance to toxic outputs. The models also struggle with privacy protection, often disclosing private information. In terms of robustness, Med-LVLMs fail to handle out-of-distribution data effectively. The study underscores the need for a comprehensive evaluation of Med-LVLMs to ensure their reliability and safety in medical applications. The results highlight the importance of addressing trustworthiness in Med-LVLMs to prevent potential harm to patients.CARES is a comprehensive benchmark for evaluating the trustworthiness of Medical Large Vision Language Models (Med-LVLMs) across five dimensions: trustfulness, fairness, safety, privacy, and robustness. The benchmark includes 41,000 question-answer pairs across 16 medical image modalities and 27 anatomical regions. The study reveals that Med-LVLMs often exhibit factual inaccuracies, lack fairness across demographic groups, are vulnerable to attacks, and fail to protect patient privacy. The models also show poor uncertainty estimation and overconfidence, leading to potential misdiagnoses. CARES evaluates four open-source Med-LVLMs and two advanced general LVLMs, finding that while LLaVA-Med performs best in factuality, it shows excessive caution and low accuracy in some cases. The benchmark also highlights significant disparities in model performance across different age, gender, and racial groups. Med-LVLMs are susceptible to jailbreaking, overcautiousness, and toxicity, with LLaVA-Med showing the strongest resistance to toxic outputs. The models also struggle with privacy protection, often disclosing private information. In terms of robustness, Med-LVLMs fail to handle out-of-distribution data effectively. The study underscores the need for a comprehensive evaluation of Med-LVLMs to ensure their reliability and safety in medical applications. The results highlight the importance of addressing trustworthiness in Med-LVLMs to prevent potential harm to patients.