28 Feb 2024 | Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, Jun Liu
This paper proposes a novel decoding method called Image-Biased Decoding (IBD) to alleviate hallucinations in Large Vision-Language Models (LVLMs). Hallucinations in LVLMs are primarily caused by over-reliance on linguistic priors, leading to incorrect or irrelevant responses. IBD addresses this by contrasting predictions from a conventional LVLM with those of an image-biased LVLM, which emphasizes image information. The method computes a next-token probability distribution by amplifying the correct information related to image content while reducing hallucinatory errors from text dependence. A comprehensive statistical analysis validates the method's reliability, and an adaptive adjustment strategy ensures robust performance under varying conditions. Experimental results across multiple evaluation metrics show that IBD significantly reduces hallucinations in LVLMs without requiring additional training data or a large increase in model parameters. IBD operates with minimal overhead, offers comprehensive processing capability, and demonstrates superior performance compared to existing methods. The method is effective in reducing hallucinations by focusing on image information during decoding, particularly for content words. However, it may not be suitable for function words, where text priors are more relevant. A dynamic adjustment mechanism is introduced to adaptively balance the use of image and text information based on the current decoding step. The method is further enhanced by fine-tuning the image-biased model and incorporating an adaptive plausibility constraint to improve prediction accuracy. The results show that IBD outperforms other methods in reducing hallucinations, particularly in tasks requiring detailed and accurate image-based reasoning. The study also highlights the existence of image-biased hallucinations, where the model's reliance on visual cues can lead to incorrect inferences. The findings suggest that while image-biased hallucinations are less common in mainstream evaluation frameworks, they can still occur in specific scenarios. The proposed method provides a robust solution for mitigating hallucinations in LVLMs, with potential applications in various vision-language tasks.This paper proposes a novel decoding method called Image-Biased Decoding (IBD) to alleviate hallucinations in Large Vision-Language Models (LVLMs). Hallucinations in LVLMs are primarily caused by over-reliance on linguistic priors, leading to incorrect or irrelevant responses. IBD addresses this by contrasting predictions from a conventional LVLM with those of an image-biased LVLM, which emphasizes image information. The method computes a next-token probability distribution by amplifying the correct information related to image content while reducing hallucinatory errors from text dependence. A comprehensive statistical analysis validates the method's reliability, and an adaptive adjustment strategy ensures robust performance under varying conditions. Experimental results across multiple evaluation metrics show that IBD significantly reduces hallucinations in LVLMs without requiring additional training data or a large increase in model parameters. IBD operates with minimal overhead, offers comprehensive processing capability, and demonstrates superior performance compared to existing methods. The method is effective in reducing hallucinations by focusing on image information during decoding, particularly for content words. However, it may not be suitable for function words, where text priors are more relevant. A dynamic adjustment mechanism is introduced to adaptively balance the use of image and text information based on the current decoding step. The method is further enhanced by fine-tuning the image-biased model and incorporating an adaptive plausibility constraint to improve prediction accuracy. The results show that IBD outperforms other methods in reducing hallucinations, particularly in tasks requiring detailed and accurate image-based reasoning. The study also highlights the existence of image-biased hallucinations, where the model's reliance on visual cues can lead to incorrect inferences. The findings suggest that while image-biased hallucinations are less common in mainstream evaluation frameworks, they can still occur in specific scenarios. The proposed method provides a robust solution for mitigating hallucinations in LVLMs, with potential applications in various vision-language tasks.