Understanding ViGoR%3A Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

The paper "ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling" addresses the issue of inaccurate visual grounding in large vision language models (LVLMs). LVLMs, which combine natural language understanding and image perception, often generate text that contains hallucinations, missing details, and incorrect attributions. To improve this, the authors introduce ViGoR, a framework that uses fine-grained reward modeling to enhance visual grounding. This approach leverages human evaluations and automated methods to train a reward model, which then fine-tunes the LVLM. The effectiveness of ViGoR is demonstrated through various benchmarks and evaluation methods, showing significant improvements over pre-trained baselines. The authors also plan to release a dataset of 16,000 image-text pairs with fine-grained evaluations to contribute to the community. The paper discusses related work in large vision language models, visual perception models, and reward modeling, and provides a detailed overview of the ViGoR framework, including its components and experimental results.The paper "ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling" addresses the issue of inaccurate visual grounding in large vision language models (LVLMs). LVLMs, which combine natural language understanding and image perception, often generate text that contains hallucinations, missing details, and incorrect attributions. To improve this, the authors introduce ViGoR, a framework that uses fine-grained reward modeling to enhance visual grounding. This approach leverages human evaluations and automated methods to train a reward model, which then fine-tunes the LVLM. The effectiveness of ViGoR is demonstrated through various benchmarks and evaluation methods, showing significant improvements over pre-trained baselines. The authors also plan to release a dataset of 16,000 image-text pairs with fine-grained evaluations to contribute to the community. The paper discusses related work in large vision language models, visual perception models, and reward modeling, and provides a detailed overview of the ViGoR framework, including its components and experimental results.

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

18 Apr 2024 | Siming Yan*†1, Min Bai*2, Weifeng Chen2, Xiong Zhou2, Qixing Huang1, Li Erran Li2

18 Apr 2024 | Siming Yan†1, Min Bai2, Weifeng Chen2, Xiong Zhou2, Qixing Huang1, Li Erran Li2