April 2025 | Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou
This survey provides a comprehensive analysis of hallucination in multimodal large language models (MLLMs), also known as large vision-language models (LVLMs). MLLMs have shown significant progress in multimodal tasks, but they often generate outputs inconsistent with visual content, a challenge known as hallucination. This issue poses obstacles to practical deployment and raises concerns about reliability. The survey reviews recent advances in identifying, evaluating, and mitigating hallucinations, offering a detailed overview of underlying causes, evaluation benchmarks, metrics, and strategies. It also analyzes current challenges and limitations, formulating open questions for future research. By classifying hallucination causes, evaluation benchmarks, and mitigation methods, the survey aims to deepen understanding of hallucinations in MLLMs and inspire further advancements. The survey highlights the unique challenges of MLLMs, including data, model, training, and inference factors. It presents metrics and benchmarks for evaluating hallucinations, such as CHAIR, POPE, MME, CIEM, MMHal-Bench, GAVIE, NOPE, HaELM, FaithScore, Bingo, AMBER, RAH-Bench, HallusionBench, CCEval, MERLIM, FGHE, OpenCHAIR, Hal-Eval, CorrelationQA, VQAv2-IDK, MHaluBench, and VHTest. These benchmarks assess different aspects of hallucination, including object hallucination, attribute hallucination, relation hallucination, event hallucination, and spurious visual inputs. The survey contributes to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners.This survey provides a comprehensive analysis of hallucination in multimodal large language models (MLLMs), also known as large vision-language models (LVLMs). MLLMs have shown significant progress in multimodal tasks, but they often generate outputs inconsistent with visual content, a challenge known as hallucination. This issue poses obstacles to practical deployment and raises concerns about reliability. The survey reviews recent advances in identifying, evaluating, and mitigating hallucinations, offering a detailed overview of underlying causes, evaluation benchmarks, metrics, and strategies. It also analyzes current challenges and limitations, formulating open questions for future research. By classifying hallucination causes, evaluation benchmarks, and mitigation methods, the survey aims to deepen understanding of hallucinations in MLLMs and inspire further advancements. The survey highlights the unique challenges of MLLMs, including data, model, training, and inference factors. It presents metrics and benchmarks for evaluating hallucinations, such as CHAIR, POPE, MME, CIEM, MMHal-Bench, GAVIE, NOPE, HaELM, FaithScore, Bingo, AMBER, RAH-Bench, HallusionBench, CCEval, MERLIM, FGHE, OpenCHAIR, Hal-Eval, CorrelationQA, VQAv2-IDK, MHaluBench, and VHTest. These benchmarks assess different aspects of hallucination, including object hallucination, attribute hallucination, relation hallucination, event hallucination, and spurious visual inputs. The survey contributes to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners.