ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

18 Apr 2024 | Siming Yan, Min Bai, Weifeng Chen, Qixing Huang, Li Erran Li
ViGoR improves the visual grounding of large vision language models (LVLMs) using fine-grained reward modeling. The method enhances the accuracy and detail of LVLMs by incorporating human feedback and automated methods, significantly improving their ability to generate accurate descriptions of images. The approach involves training a reward model to evaluate the quality of generated text at the sentence level, which is then used to fine-tune the LVLM. This process is efficient, requiring only 16,000 image-text pairs with fine-grained evaluations. The method also includes an automated component that leverages powerful visual perception models to assess the presence or absence of objects in images, further refining the model's visual grounding capabilities. ViGoR is evaluated on benchmarks such as POPE and MME, where it demonstrates significant improvements over baseline models. The framework is designed to be general and applicable to various LVLMs, enhancing their ability to generate accurate and detailed descriptions while preserving their creative and reasoning capabilities. The results show that ViGoR significantly improves the visual grounding of LVLMs, reducing hallucinations and errors in reasoning, counting, and object relationships. The method is efficient, cost-effective, and generalizable, making it a valuable contribution to the field of vision-language models.ViGoR improves the visual grounding of large vision language models (LVLMs) using fine-grained reward modeling. The method enhances the accuracy and detail of LVLMs by incorporating human feedback and automated methods, significantly improving their ability to generate accurate descriptions of images. The approach involves training a reward model to evaluate the quality of generated text at the sentence level, which is then used to fine-tune the LVLM. This process is efficient, requiring only 16,000 image-text pairs with fine-grained evaluations. The method also includes an automated component that leverages powerful visual perception models to assess the presence or absence of objects in images, further refining the model's visual grounding capabilities. ViGoR is evaluated on benchmarks such as POPE and MME, where it demonstrates significant improvements over baseline models. The framework is designed to be general and applicable to various LVLMs, enhancing their ability to generate accurate and detailed descriptions while preserving their creative and reasoning capabilities. The results show that ViGoR significantly improves the visual grounding of LVLMs, reducing hallucinations and errors in reasoning, counting, and object relationships. The method is efficient, cost-effective, and generalizable, making it a valuable contribution to the field of vision-language models.
Reach us at info@study.space