20 Mar 2024 | Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto
Multi-Modal Hallucination Control by Visual Information Grounding
Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that are not always grounded in the input image. This phenomenon, referred to as "hallucination," stems from an excessive reliance on the language prior. As more tokens are generated, the reliance on the visual prompt decreases, leading to increased hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, favoring tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without further training. When training is possible, M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image. Empirical results show that M3ID and M3ID+DPO reduce hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve accuracy on VQA benchmarks like POPE by 21% and 24%.Multi-Modal Hallucination Control by Visual Information Grounding
Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that are not always grounded in the input image. This phenomenon, referred to as "hallucination," stems from an excessive reliance on the language prior. As more tokens are generated, the reliance on the visual prompt decreases, leading to increased hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, favoring tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without further training. When training is possible, M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image. Empirical results show that M3ID and M3ID+DPO reduce hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve accuracy on VQA benchmarks like POPE by 21% and 24%.