Visual Hallucinations of Multi-modal Large Language Models

Visual Hallucinations of Multi-modal Large Language Models

16 Jun 2024 | Wen Huang*, Hongbin Liu*, Minxin Guo, Neil Zhenqiang Gong
This paper introduces VHTest, a tool for generating diverse visual hallucination (VH) instances to evaluate the performance of multi-modal large language models (MLLMs). The authors propose a three-step process to generate VH instances: first, identifying initial VH instances from existing image datasets; second, generating text descriptions for each VH mode based on these instances; and third, using text-to-image generative models to create new VH images based on the text descriptions. They collect a benchmark dataset with 1,200 VH instances across 8 VH modes, including existence, shape, color, orientation, OCR, size, position, and counting. The authors evaluate state-of-the-art MLLMs, including GPT-4V, LLaVA-1.5, and MiniGPT-v2, on their benchmark and find that these models hallucinate for a large fraction of the instances. They also find that fine-tuning an MLLM on their benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. The authors also show that their benchmark is effective at generating successful VH instances and that fine-tuning MLLMs on their benchmark improves their performance on visual question answering tasks. The paper highlights the importance of evaluating MLLMs for visual hallucinations and provides a new benchmark for this purpose.This paper introduces VHTest, a tool for generating diverse visual hallucination (VH) instances to evaluate the performance of multi-modal large language models (MLLMs). The authors propose a three-step process to generate VH instances: first, identifying initial VH instances from existing image datasets; second, generating text descriptions for each VH mode based on these instances; and third, using text-to-image generative models to create new VH images based on the text descriptions. They collect a benchmark dataset with 1,200 VH instances across 8 VH modes, including existence, shape, color, orientation, OCR, size, position, and counting. The authors evaluate state-of-the-art MLLMs, including GPT-4V, LLaVA-1.5, and MiniGPT-v2, on their benchmark and find that these models hallucinate for a large fraction of the instances. They also find that fine-tuning an MLLM on their benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. The authors also show that their benchmark is effective at generating successful VH instances and that fine-tuning MLLMs on their benchmark improves their performance on visual question answering tasks. The paper highlights the importance of evaluating MLLMs for visual hallucinations and provides a new benchmark for this purpose.
Reach us at info@study.space