20 Mar 2024 | Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto
Generative Vision-Language Models (VLMs) often generate plausible but ungrounded textual answers, a phenomenon known as "hallucination." This paper investigates the root cause of this issue, which is an excessive reliance on the language prior. The authors introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method that amplifies the influence of the reference image over the language prior, thereby reducing hallucinations. M3ID can be applied to any pre-trained autoregressive VLM at inference time without additional training and with minimal computational overhead. Additionally, the authors propose combining M3ID with Direct Preference Optimization (DPO) to further improve the model's reliance on the prompt image without requiring labels. Empirical results show that M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve accuracy on VQA benchmarks such as POPE by 21% and 24%. The paper also discusses related work, analyzes hallucinations in VLMs, and provides experimental details and ablation studies.Generative Vision-Language Models (VLMs) often generate plausible but ungrounded textual answers, a phenomenon known as "hallucination." This paper investigates the root cause of this issue, which is an excessive reliance on the language prior. The authors introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method that amplifies the influence of the reference image over the language prior, thereby reducing hallucinations. M3ID can be applied to any pre-trained autoregressive VLM at inference time without additional training and with minimal computational overhead. Additionally, the authors propose combining M3ID with Direct Preference Optimization (DPO) to further improve the model's reliance on the prompt image without requiring labels. Empirical results show that M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve accuracy on VQA benchmarks such as POPE by 21% and 24%. The paper also discusses related work, analyzes hallucinations in VLMs, and provides experimental details and ablation studies.