Multi-Object Hallucination in Vision-Language Models

Multi-Object Hallucination in Vision-Language Models

8 Jul 2024 | Xuwei Chen, Ziqi Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianqiao Yang, David F. Fouhey, Joyce Chai
This paper investigates multi-object hallucination in large vision-language models (LVLMs), where models generate objects not present in the input images when asked to focus on multiple objects. The authors introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers object class distribution within images and uses visual prompts to reduce ambiguity. ROPE divides the task into four subsets: In-the-Wild, Homogeneous, Heterogeneous, and Adversarial, and evaluates LVLMs on their ability to recognize multiple objects accurately. The study finds that LVLMs hallucinate more when focusing on multiple objects compared to a single object, and that hallucination behaviors are influenced by object class distribution, data-specific factors, and model behaviors. The results show that LVLMs often hallucinate objects when they are less salient or less frequent in the training data, and that models may rely on shortcuts or spurious correlations. The authors suggest that future work should focus on improving object recognition in multi-object scenarios, using more balanced object distributions, and incorporating diverse annotations. The study highlights the importance of addressing multi-object hallucination in LVLMs to improve their performance in real-world applications.This paper investigates multi-object hallucination in large vision-language models (LVLMs), where models generate objects not present in the input images when asked to focus on multiple objects. The authors introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers object class distribution within images and uses visual prompts to reduce ambiguity. ROPE divides the task into four subsets: In-the-Wild, Homogeneous, Heterogeneous, and Adversarial, and evaluates LVLMs on their ability to recognize multiple objects accurately. The study finds that LVLMs hallucinate more when focusing on multiple objects compared to a single object, and that hallucination behaviors are influenced by object class distribution, data-specific factors, and model behaviors. The results show that LVLMs often hallucinate objects when they are less salient or less frequent in the training data, and that models may rely on shortcuts or spurious correlations. The authors suggest that future work should focus on improving object recognition in multi-object scenarios, using more balanced object distributions, and incorporating diverse annotations. The study highlights the importance of addressing multi-object hallucination in LVLMs to improve their performance in real-world applications.
Reach us at info@study.space
[slides] Multi-Object Hallucination in Vision-Language Models | StudySpace