4 Mar 2024 | David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
The paper introduces Contrastive Region Guidance (CRG), a training-free method to improve the visual grounding capabilities of vision-language models (VLMs). CRG leverages classifier-free guidance (CFG) to help VLMs focus on specific regions of interest by contrasting model outputs with and without visual prompts. This approach reduces biases in the model's responses, leading to more accurate answers. CRG is evaluated on various vision-language tasks, including visual prompt following, spatial reasoning, compositional generalization, and image-text alignment. The results show that CRG significantly improves performance, achieving up to 11.1% accuracy increase on ViP-Bench and substantial improvements on other benchmarks. The method is also effective in re-ranking region proposals from object detectors and enhancing the model's ability to follow visual prompts without additional training or data. The paper discusses the effectiveness of different region guidance strategies and analyzes the impact of the region guidance strength, validating the robustness of CRG's design choices.The paper introduces Contrastive Region Guidance (CRG), a training-free method to improve the visual grounding capabilities of vision-language models (VLMs). CRG leverages classifier-free guidance (CFG) to help VLMs focus on specific regions of interest by contrasting model outputs with and without visual prompts. This approach reduces biases in the model's responses, leading to more accurate answers. CRG is evaluated on various vision-language tasks, including visual prompt following, spatial reasoning, compositional generalization, and image-text alignment. The results show that CRG significantly improves performance, achieving up to 11.1% accuracy increase on ViP-Bench and substantial improvements on other benchmarks. The method is also effective in re-ranking region proposals from object detectors and enhancing the model's ability to follow visual prompts without additional training or data. The paper discusses the effectiveness of different region guidance strategies and analyzes the impact of the region guidance strength, validating the robustness of CRG's design choices.