[slides and audio] Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance

This paper introduces MARINE, a training-free and API-free framework to reduce object hallucination in Large Vision-Language Models (LVLMs) during text generation. MARINE leverages a pre-trained object grounding vision encoder to enrich the visual context of LVLMs and employs classifier-free guidance to incorporate additional object grounding features, improving the precision of LVLMs' generations. Through comprehensive evaluations across six popular LVLMs with diverse metrics, MARINE demonstrates effectiveness in reducing hallucinations and improving the detailedness of LVLMs' outputs. It outperforms existing fine-tuning-based methods and maintains the original answer's style while eliminating hallucination. MARINE is compatible with any vision model and projection function, and its performance is validated using various metrics, including CHAIR, POPE, and GPT-4V evaluations. The framework is also shown to be effective in reducing hallucinations without requiring additional training resources or access to advanced LLMs. The results indicate that MARINE achieves superior performance in reducing hallucinations and improving the accuracy and detail of LVLMs' outputs.This paper introduces MARINE, a training-free and API-free framework to reduce object hallucination in Large Vision-Language Models (LVLMs) during text generation. MARINE leverages a pre-trained object grounding vision encoder to enrich the visual context of LVLMs and employs classifier-free guidance to incorporate additional object grounding features, improving the precision of LVLMs' generations. Through comprehensive evaluations across six popular LVLMs with diverse metrics, MARINE demonstrates effectiveness in reducing hallucinations and improving the detailedness of LVLMs' outputs. It outperforms existing fine-tuning-based methods and maintains the original answer's style while eliminating hallucination. MARINE is compatible with any vision model and projection function, and its performance is validated using various metrics, including CHAIR, POPE, and GPT-4V evaluations. The framework is also shown to be effective in reducing hallucinations without requiring additional training resources or access to advanced LLMs. The results indicate that MARINE achieves superior performance in reducing hallucinations and improving the accuracy and detail of LVLMs' outputs.

Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance

13 Feb 2024 | Linxi Zhao*, Yihe Deng*, Weitong Zhang§, Quanquan Gu†

13 Feb 2024 | Linxi Zhao, Yihe Deng, Weitong Zhang§, Quanquan Gu†