Learn “No” to Say “Yes” Better: Improving Vision-Language Models via Negations

Learn “No” to Say “Yes” Better: Improving Vision-Language Models via Negations

29 Mar 2024 | Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati
This paper addresses the limitations of vision-language models (VLMs) in understanding negations, which are crucial for logical and natural language reasoning. Existing VLMs, such as CLIP, often fail to accurately interpret negated captions, leading to incorrect associations between images and text. To improve this, the authors introduce CC-Neg, a dataset containing 228,246 image-caption pairs with true and negated captions. They propose CoN-CLIP, a framework that enhances the contrastive loss of CLIP by incorporating negated captions and distractor images. This approach improves the model's ability to understand negations, resulting in a 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Additionally, CoN-CLIP outperforms CLIP on compositionality benchmarks like SugarCREPE by 4.4%, demonstrating enhanced compositional understanding of objects, relations, and attributes. The work contributes a dataset and a framework that strengthen the semantic associations between images and text, promoting efficiency and accessibility in large-scale foundation models.This paper addresses the limitations of vision-language models (VLMs) in understanding negations, which are crucial for logical and natural language reasoning. Existing VLMs, such as CLIP, often fail to accurately interpret negated captions, leading to incorrect associations between images and text. To improve this, the authors introduce CC-Neg, a dataset containing 228,246 image-caption pairs with true and negated captions. They propose CoN-CLIP, a framework that enhances the contrastive loss of CLIP by incorporating negated captions and distractor images. This approach improves the model's ability to understand negations, resulting in a 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Additionally, CoN-CLIP outperforms CLIP on compositionality benchmarks like SugarCREPE by 4.4%, demonstrating enhanced compositional understanding of objects, relations, and attributes. The work contributes a dataset and a framework that strengthen the semantic associations between images and text, promoting efficiency and accessibility in large-scale foundation models.
Reach us at info@study.space
[slides] Learn %22No%22 to Say %22Yes%22 Better%3A Improving Vision-Language Models via Negations | StudySpace