Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

14 Jun 2024 | Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang
This paper introduces Med-HallMark, the first benchmark for hallucination detection and evaluation in the medical multimodal domain. It also proposes the MediHall Score, a new evaluation metric for assessing hallucinations in Large Vision Language Models (LVLMs), and MediHallDetector, a novel medical LVLM designed for precise hallucination detection. Med-HallMark provides multi-task hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. The MediHall Score evaluates hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, enabling a granular assessment of potential clinical impacts. MediHallDetector employs multitask training for hallucination detection and is designed to detect hallucinations in model output texts with fine granularity. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. The paper also discusses the challenges of hallucination detection in LVLMs in healthcare applications and proposes solutions from three dimensions: data, evaluation metrics, and detection methods. The results show that the proposed methods significantly improve the reliability of LVLMs in medical applications.This paper introduces Med-HallMark, the first benchmark for hallucination detection and evaluation in the medical multimodal domain. It also proposes the MediHall Score, a new evaluation metric for assessing hallucinations in Large Vision Language Models (LVLMs), and MediHallDetector, a novel medical LVLM designed for precise hallucination detection. Med-HallMark provides multi-task hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. The MediHall Score evaluates hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, enabling a granular assessment of potential clinical impacts. MediHallDetector employs multitask training for hallucination detection and is designed to detect hallucinations in model output texts with fine granularity. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. The paper also discusses the challenges of hallucination detection in LVLMs in healthcare applications and proposes solutions from three dimensions: data, evaluation metrics, and detection methods. The results show that the proposed methods significantly improve the reliability of LVLMs in medical applications.
Reach us at info@study.space