Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

2024-03-04 | David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Contrastive Region Guidance (CRG) is a training-free method that enhances the ability of vision-language models (VLMs) to follow visual prompts without requiring additional training. CRG works by contrasting model outputs generated with and without visual prompts, effectively removing biases that the model may have when answering without the necessary visual information. This approach significantly improves performance on various vision-language tasks, including spatial reasoning, compositional generalization, and image-text alignment. CRG achieves up to 11.1% improvement in accuracy on ViP-Bench, a benchmark with six diverse region-based tasks. It also improves performance on the hardest setting of the What'sUp benchmark by up to 10%, and on SugarCrepe by 11.5% and 7.5% on two challenging splits. CRG is effective in evaluating generated images, improving AUROC by 8.4 points and F1 by 6.8 points on SeeTRUE. Additionally, CRG helps re-rank bounding box proposals in referring expression comprehension and phrase grounding tasks, achieving up to 3.2% accuracy improvement. CRG's effectiveness is demonstrated through extensive experiments across multiple datasets and models, showing its versatility and robustness. The method is based on classifier-free guidance and is compatible with a variety of existing models, making it a valuable tool for improving visual grounding in VLMs.Contrastive Region Guidance (CRG) is a training-free method that enhances the ability of vision-language models (VLMs) to follow visual prompts without requiring additional training. CRG works by contrasting model outputs generated with and without visual prompts, effectively removing biases that the model may have when answering without the necessary visual information. This approach significantly improves performance on various vision-language tasks, including spatial reasoning, compositional generalization, and image-text alignment. CRG achieves up to 11.1% improvement in accuracy on ViP-Bench, a benchmark with six diverse region-based tasks. It also improves performance on the hardest setting of the What'sUp benchmark by up to 10%, and on SugarCrepe by 11.5% and 7.5% on two challenging splits. CRG is effective in evaluating generated images, improving AUROC by 8.4 points and F1 by 6.8 points on SeeTRUE. Additionally, CRG helps re-rank bounding box proposals in referring expression comprehension and phrase grounding tasks, achieving up to 3.2% accuracy improvement. CRG's effectiveness is demonstrated through extensive experiments across multiple datasets and models, showing its versatility and robustness. The method is based on classifier-free guidance and is compatible with a variety of existing models, making it a valuable tool for improving visual grounding in VLMs.
Reach us at info@study.space
[slides and audio] Contrastive Region Guidance%3A Improving Grounding in Vision-Language Models without Training