October 28-November 1, 2024 | Jinfeng Wei, Xiaofeng Zhang
DOPRA is a novel approach designed to mitigate hallucinations in multimodal large language models (MLLMs). Unlike existing solutions that rely on costly training data or external knowledge, DOPRA addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. It is grounded in insights into the mechanisms controlling hallucinations in MLLMs, particularly the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. To counteract this over-reliance, DOPRA employs weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Additionally, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process. The method introduces two core strategies: decoding over-accumulation penalization at specific attention layers and reallocation. DOPRA integrates over-accumulation penalties into the beam search process by applying weighted scores to candidate selections, effectively preventing tokens that exhibit strong patterns of over-trust. It also implements a retrospective reallocation strategy to disrupt excessive accumulation dynamics. Comprehensive evaluations demonstrate DOPRA’s superior performance, proving it to be a practically cost-free intervention that effectively mitigates hallucinations in multimodal language models, thereby enhancing the credibility and reliability of these powerful AI tools in real-world applications. DOPRA's contributions include an innovative solution for hallucination issues in MLLMs during inference without requiring external data, knowledge repositories, or additional training procedures. It identifies the critical role played by summary tokens in the formation of hallucinations and develops a penalty-based decoding technique augmented with a backtracking reallocation strategy to disrupt excessive accumulation dynamics. Comprehensive evaluations demonstrate DOPRA’s superior performance, proving it to be a practically cost-free intervention that effectively mitigates hallucinations in multimodal language models, thereby enhancing the credibility and reliability of these powerful AI tools in real-world applications.DOPRA is a novel approach designed to mitigate hallucinations in multimodal large language models (MLLMs). Unlike existing solutions that rely on costly training data or external knowledge, DOPRA addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. It is grounded in insights into the mechanisms controlling hallucinations in MLLMs, particularly the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. To counteract this over-reliance, DOPRA employs weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Additionally, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process. The method introduces two core strategies: decoding over-accumulation penalization at specific attention layers and reallocation. DOPRA integrates over-accumulation penalties into the beam search process by applying weighted scores to candidate selections, effectively preventing tokens that exhibit strong patterns of over-trust. It also implements a retrospective reallocation strategy to disrupt excessive accumulation dynamics. Comprehensive evaluations demonstrate DOPRA’s superior performance, proving it to be a practically cost-free intervention that effectively mitigates hallucinations in multimodal language models, thereby enhancing the credibility and reliability of these powerful AI tools in real-world applications. DOPRA's contributions include an innovative solution for hallucination issues in MLLMs during inference without requiring external data, knowledge repositories, or additional training procedures. It identifies the critical role played by summary tokens in the formation of hallucinations and develops a penalty-based decoding technique augmented with a backtracking reallocation strategy to disrupt excessive accumulation dynamics. Comprehensive evaluations demonstrate DOPRA’s superior performance, proving it to be a practically cost-free intervention that effectively mitigates hallucinations in multimodal language models, thereby enhancing the credibility and reliability of these powerful AI tools in real-world applications.