Understanding VideoHallucer%3A Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

This paper introduces VideoHallucer, a comprehensive benchmark for evaluating hallucinations in large video-language models (LVLMs). The benchmark categorizes hallucinations into two main types: intrinsic and extrinsic, with subcategories for detailed analysis. Intrinsic hallucinations involve content that contradicts the original video, while extrinsic hallucinations involve content that cannot be verified from the source. The evaluation method uses an adversarial binary VideoQA approach, where pairs of basic and hallucinated questions are crafted to rigorously test models. The study evaluates eleven LVLMs on VideoHallucer, revealing significant issues with hallucinations, limited benefits from scaling datasets and parameters for detecting extrinsic factual hallucinations, and models' better performance in recognizing facts than detecting hallucinations. The paper also introduces Self-PEP, a framework that enhances models' hallucination resistance through self-improvement and explanation mechanisms, achieving an average improvement of 5.38% across all model architectures.This paper introduces VideoHallucer, a comprehensive benchmark for evaluating hallucinations in large video-language models (LVLMs). The benchmark categorizes hallucinations into two main types: intrinsic and extrinsic, with subcategories for detailed analysis. Intrinsic hallucinations involve content that contradicts the original video, while extrinsic hallucinations involve content that cannot be verified from the source. The evaluation method uses an adversarial binary VideoQA approach, where pairs of basic and hallucinated questions are crafted to rigorously test models. The study evaluates eleven LVLMs on VideoHallucer, revealing significant issues with hallucinations, limited benefits from scaling datasets and parameters for detecting extrinsic factual hallucinations, and models' better performance in recognizing facts than detecting hallucinations. The paper also introduces Self-PEP, a framework that enhances models' hallucination resistance through self-improvement and explanation mechanisms, achieving an average improvement of 5.38% across all model architectures.

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

24 Jun 2024 | Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng