Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

28 May 2024 | Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim
This paper addresses the issue of hallucinations in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as "blind tokens," leads to inaccurate responses in tasks requiring fine-grained visual understanding. The authors propose Attentional Vision Calibration (AViSC), a technique that dynamically adjusts the logits for the next token prediction by contrasting the logits conditioned on the original visual tokens with those conditioned on the blind tokens. This helps reduce the model's reliance on blind tokens and promotes a more balanced consideration of all tokens. AViSC is validated on benchmarks such as POPE, MME, and AMBER, where it consistently outperforms existing decoding techniques in mitigating object hallucinations in LVLMs. The study highlights that LVLMs tend to focus their attention on a few image tokens, even when those tokens do not carry significant object-discriminative information. This attention bias can lead to hallucinations, where the model generates responses that do not accurately reflect the visual data. The authors analyze the attention patterns in LVLMs and find that tokens with lower attention weights often contain essential information for identifying nuanced object details. AViSC addresses this by recalibrating the model's attention during the decoding phase, without requiring additional training, external models, or costly self-feedback mechanisms. The proposed method operates in three steps: (1) selecting layers that allocate a higher attention proportion to image tokens, (2) identifying blind tokens that disproportionately monopolize attention, and (3) using contrastive decoding to adjust the decoding process and reduce the influence of blind tokens. The results show that AViSC significantly mitigates hallucinations while improving the models' ability to capture and describe detailed image attributes more accurately. The study also compares AViSC with existing methods such as VCD and M3ID, demonstrating that AViSC outperforms these methods in reducing hallucinations and improving accuracy across multiple metrics. The effectiveness of AViSC is validated across several benchmarks, where it consistently outperforms existing decoding techniques. AViSC not only boosts reliability but also ensures more trustworthy applications of LVLMs in real-world scenarios requiring fine-grained visual understanding.This paper addresses the issue of hallucinations in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as "blind tokens," leads to inaccurate responses in tasks requiring fine-grained visual understanding. The authors propose Attentional Vision Calibration (AViSC), a technique that dynamically adjusts the logits for the next token prediction by contrasting the logits conditioned on the original visual tokens with those conditioned on the blind tokens. This helps reduce the model's reliance on blind tokens and promotes a more balanced consideration of all tokens. AViSC is validated on benchmarks such as POPE, MME, and AMBER, where it consistently outperforms existing decoding techniques in mitigating object hallucinations in LVLMs. The study highlights that LVLMs tend to focus their attention on a few image tokens, even when those tokens do not carry significant object-discriminative information. This attention bias can lead to hallucinations, where the model generates responses that do not accurately reflect the visual data. The authors analyze the attention patterns in LVLMs and find that tokens with lower attention weights often contain essential information for identifying nuanced object details. AViSC addresses this by recalibrating the model's attention during the decoding phase, without requiring additional training, external models, or costly self-feedback mechanisms. The proposed method operates in three steps: (1) selecting layers that allocate a higher attention proportion to image tokens, (2) identifying blind tokens that disproportionately monopolize attention, and (3) using contrastive decoding to adjust the decoding process and reduce the influence of blind tokens. The results show that AViSC significantly mitigates hallucinations while improving the models' ability to capture and describe detailed image attributes more accurately. The study also compares AViSC with existing methods such as VCD and M3ID, demonstrating that AViSC outperforms these methods in reducing hallucinations and improving accuracy across multiple metrics. The effectiveness of AViSC is validated across several benchmarks, where it consistently outperforms existing decoding techniques. AViSC not only boosts reliability but also ensures more trustworthy applications of LVLMs in real-world scenarios requiring fine-grained visual understanding.
Reach us at info@study.space
[slides and audio] Don't Miss the Forest for the Trees%3A Attentional Vision Calibration for Large Vision Language Models