VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

24 Jun 2024 | Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng
VideoHallucer is a comprehensive benchmark for evaluating hallucinations in large video-language models (LVLMs). It categorizes hallucinations into intrinsic and extrinsic types, with intrinsic hallucinations being inconsistent with the source video, and extrinsic hallucinations being unverifiable from the source. The benchmark uses an adversarial binary VideoQA method to evaluate models, including pairs of basic and hallucinated questions. Evaluating eleven LVLMs on VideoHallucer reveals that most models have significant hallucination issues, and while increasing dataset and parameter size improves basic visual cue detection, it has limited impact on extrinsic factual hallucination detection. Existing models are more adept at detecting facts than hallucinations. These findings inform the development of the self-PEP framework, which improves hallucination resistance by 5.38% across all model architectures. The benchmark includes five settings: object-relation, temporal, semantic detail, factual, and non-factual hallucinations. Each setting contains 400 question-answer pairs, with 200 basic and 200 hallucinated questions. The dataset includes 948 videos, ranging from 7 to 187 seconds, and covers various video understanding benchmarks. The benchmark uses a VQA-based approach for evaluation, which is more reliable than caption-based methods due to its resistance to external factors and complexity. The benchmark also includes bias evaluation metrics, such as Yes Percentage Difference and False Positive Ratio, to assess model bias. The benchmark evaluates twelve LVLMs, revealing that most models perform poorly in factual hallucination settings. Models with a clear bias tend to have more hallucination issues. The benchmark also compares image-language models with video-language models, finding that image-language models perform better in detecting object-relation hallucinations. The benchmark includes human evaluations, showing moderate agreement among evaluators. The self-PEP framework is proposed to improve hallucination detection by incorporating explanatory processes. The framework includes a self-improvement phase and a self-explanation phase, where models autonomously extract visual knowledge and generate explanations to refine predictions. The framework significantly improves model performance on the VideoHallucer benchmark, particularly in extrinsic factual hallucination detection. The framework is effective in reducing hallucination issues in existing LVLMs, especially factual hallucinations.VideoHallucer is a comprehensive benchmark for evaluating hallucinations in large video-language models (LVLMs). It categorizes hallucinations into intrinsic and extrinsic types, with intrinsic hallucinations being inconsistent with the source video, and extrinsic hallucinations being unverifiable from the source. The benchmark uses an adversarial binary VideoQA method to evaluate models, including pairs of basic and hallucinated questions. Evaluating eleven LVLMs on VideoHallucer reveals that most models have significant hallucination issues, and while increasing dataset and parameter size improves basic visual cue detection, it has limited impact on extrinsic factual hallucination detection. Existing models are more adept at detecting facts than hallucinations. These findings inform the development of the self-PEP framework, which improves hallucination resistance by 5.38% across all model architectures. The benchmark includes five settings: object-relation, temporal, semantic detail, factual, and non-factual hallucinations. Each setting contains 400 question-answer pairs, with 200 basic and 200 hallucinated questions. The dataset includes 948 videos, ranging from 7 to 187 seconds, and covers various video understanding benchmarks. The benchmark uses a VQA-based approach for evaluation, which is more reliable than caption-based methods due to its resistance to external factors and complexity. The benchmark also includes bias evaluation metrics, such as Yes Percentage Difference and False Positive Ratio, to assess model bias. The benchmark evaluates twelve LVLMs, revealing that most models perform poorly in factual hallucination settings. Models with a clear bias tend to have more hallucination issues. The benchmark also compares image-language models with video-language models, finding that image-language models perform better in detecting object-relation hallucinations. The benchmark includes human evaluations, showing moderate agreement among evaluators. The self-PEP framework is proposed to improve hallucination detection by incorporating explanatory processes. The framework includes a self-improvement phase and a self-explanation phase, where models autonomously extract visual knowledge and generate explanations to refine predictions. The framework significantly improves model performance on the VideoHallucer benchmark, particularly in extrinsic factual hallucination detection. The framework is effective in reducing hallucination issues in existing LVLMs, especially factual hallucinations.
Reach us at info@study.space
Understanding VideoHallucer%3A Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models