[slides] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

This paper addresses the challenges of hallucination detection and evaluation in large vision language models (LVLMs) applied to healthcare, particularly in medical visual question answering and imaging report generation. The authors introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation in the medical multimodal domain. This benchmark supports multi-task hallucination tasks, provides multifaceted hallucination data, and categorizes hallucinations hierarchically. They also propose the MediHall Score, a new evaluation metric that assesses LVLMs' hallucinations through a hierarchical scoring system, considering the severity and type of hallucination. Additionally, they present MediHallDetector, a novel LVLM designed for precise hallucination detection, which employs multitask training. Extensive experimental evaluations using popular LVLMs on the Med-HallMark benchmark establish baselines and demonstrate the effectiveness of the proposed methods. The findings show that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics, and MediHallDetector outperforms existing models in hallucination detection. The authors hope that these contributions will significantly improve the reliability of LVLMs in medical applications.This paper addresses the challenges of hallucination detection and evaluation in large vision language models (LVLMs) applied to healthcare, particularly in medical visual question answering and imaging report generation. The authors introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation in the medical multimodal domain. This benchmark supports multi-task hallucination tasks, provides multifaceted hallucination data, and categorizes hallucinations hierarchically. They also propose the MediHall Score, a new evaluation metric that assesses LVLMs' hallucinations through a hierarchical scoring system, considering the severity and type of hallucination. Additionally, they present MediHallDetector, a novel LVLM designed for precise hallucination detection, which employs multitask training. Extensive experimental evaluations using popular LVLMs on the Med-HallMark benchmark establish baselines and demonstrate the effectiveness of the proposed methods. The findings show that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics, and MediHallDetector outperforms existing models in hallucination detection. The authors hope that these contributions will significantly improve the reliability of LVLMs in medical applications.

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

14 Jun 2024 | Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang