Visual Hallucinations of Multi-modal Large Language Models

Visual Hallucinations of Multi-modal Large Language Models

16 Jun 2024 | Wen Huang, Hongbin Liu, Minxin Guo, Neil Zhenqiang Gong
The paper "Visual Hallucinations of Multi-modal Large Language Models" by Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong addresses the issue of visual hallucinations (VH) in multi-modal large language models (MLLMs). VH occurs when MLLMs generate incorrect details about an image in visual question answering tasks. The authors propose VHTest, a tool that generates diverse VH instances to evaluate MLLMs' performance. VHTest identifies initial VH instances from existing image datasets, generates text descriptions for each VH mode, and uses a text-to-image generative model to create new VH images. The benchmark dataset constructed using VHTest includes 1,200 VH instances across 8 VH modes: existence, shape, color, orientation, OCR, size, position, and counting. The authors find that state-of-the-art MLLMs like GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a significant fraction of the instances in the benchmark. Fine-tuning these models on the VH benchmark dataset reduces their likelihood to hallucinate without compromising performance on other benchmarks. The paper also discusses the limitations and future work, emphasizing the need for fully automatic VH instance generation.The paper "Visual Hallucinations of Multi-modal Large Language Models" by Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong addresses the issue of visual hallucinations (VH) in multi-modal large language models (MLLMs). VH occurs when MLLMs generate incorrect details about an image in visual question answering tasks. The authors propose VHTest, a tool that generates diverse VH instances to evaluate MLLMs' performance. VHTest identifies initial VH instances from existing image datasets, generates text descriptions for each VH mode, and uses a text-to-image generative model to create new VH images. The benchmark dataset constructed using VHTest includes 1,200 VH instances across 8 VH modes: existence, shape, color, orientation, OCR, size, position, and counting. The authors find that state-of-the-art MLLMs like GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a significant fraction of the instances in the benchmark. Fine-tuning these models on the VH benchmark dataset reduces their likelihood to hallucinate without compromising performance on other benchmarks. The paper also discusses the limitations and future work, emphasizing the need for fully automatic VH instance generation.
Reach us at info@study.space
[slides] Visual Hallucinations of Multi-modal Large Language Models | StudySpace