Understanding Multi-Object Hallucination in Vision-Language Models

This paper investigates multi-object hallucination in large vision-language models (LVLMs), a phenomenon where models produce objects not present in the given images. Unlike existing benchmarks that focus on single object classes, this work systematically examines how models misperceive multiple objects simultaneously. The authors introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within an image and uses visual prompts to eliminate ambiguity. Through comprehensive empirical studies, they find that: 1. LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. 2. The distribution of tested object classes affects hallucination behaviors, indicating that models may follow shortcuts and spurious correlations. 3. Hallucinatory behaviors are influenced by data-specific factors such as salience and frequency, as well as model intrinsic behaviors. The findings provide insights for improving LVLMs to better recognize and reason about multiple objects in realistic visual scenes. The paper also discusses potential design considerations to reduce multi-object hallucinations, including balanced object distributions, diverse annotations, and enhanced multi-object instructions.This paper investigates multi-object hallucination in large vision-language models (LVLMs), a phenomenon where models produce objects not present in the given images. Unlike existing benchmarks that focus on single object classes, this work systematically examines how models misperceive multiple objects simultaneously. The authors introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within an image and uses visual prompts to eliminate ambiguity. Through comprehensive empirical studies, they find that: 1. LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. 2. The distribution of tested object classes affects hallucination behaviors, indicating that models may follow shortcuts and spurious correlations. 3. Hallucinatory behaviors are influenced by data-specific factors such as salience and frequency, as well as model intrinsic behaviors. The findings provide insights for improving LVLMs to better recognize and reason about multiple objects in realistic visual scenes. The paper also discusses potential design considerations to reduce multi-object hallucinations, including balanced object distributions, diverse annotations, and enhanced multi-object instructions.

Multi-Object Hallucination in Vision-Language Models

8 Jul 2024 | Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai