29 Mar 2024 | Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati
This paper addresses the challenge of vision-language models (VLMs) in understanding negations in text, which is crucial for accurate image-text matching and text-to-image generation. Existing VLMs, such as CLIP, often fail to correctly interpret negation words like "not," "no," and "without," leading to incorrect associations between images and text. To evaluate and improve VLMs' ability to understand negations, the authors introduce CC-Neg, a large-scale dataset containing 228,246 image-caption pairs along with their corresponding negated captions. This dataset is used to train a new framework called CoN-CLIP, which enhances the understanding of negations through a modified contrastive learning approach.
The proposed CoN-CLIP framework improves the ability of VLMs to understand negations by incorporating negated captions and relevant distractor images into the training process. This approach leads to significant improvements in zero-shot image classification accuracy across multiple datasets, with a 3.85% average gain in top-1 accuracy. Additionally, CoN-CLIP outperforms CLIP on compositionality benchmarks such as SugarCREPE by 4.4%, demonstrating enhanced compositional understanding of objects, relations, and attributes in text.
The study highlights the importance of negations in logic and natural language, as well as the challenges in processing negative sentences. The authors argue that understanding negations is essential for commonsense reasoning tasks and that the proposed framework addresses a crucial limitation of VLMs by strengthening the semantic associations between images and text. The results show that CoN-CLIP not only improves negation understanding but also enhances overall compositional understanding, making it a more effective foundation model for vision-language tasks. The code for this work is available at the provided GitHub link.This paper addresses the challenge of vision-language models (VLMs) in understanding negations in text, which is crucial for accurate image-text matching and text-to-image generation. Existing VLMs, such as CLIP, often fail to correctly interpret negation words like "not," "no," and "without," leading to incorrect associations between images and text. To evaluate and improve VLMs' ability to understand negations, the authors introduce CC-Neg, a large-scale dataset containing 228,246 image-caption pairs along with their corresponding negated captions. This dataset is used to train a new framework called CoN-CLIP, which enhances the understanding of negations through a modified contrastive learning approach.
The proposed CoN-CLIP framework improves the ability of VLMs to understand negations by incorporating negated captions and relevant distractor images into the training process. This approach leads to significant improvements in zero-shot image classification accuracy across multiple datasets, with a 3.85% average gain in top-1 accuracy. Additionally, CoN-CLIP outperforms CLIP on compositionality benchmarks such as SugarCREPE by 4.4%, demonstrating enhanced compositional understanding of objects, relations, and attributes in text.
The study highlights the importance of negations in logic and natural language, as well as the challenges in processing negative sentences. The authors argue that understanding negations is essential for commonsense reasoning tasks and that the proposed framework addresses a crucial limitation of VLMs by strengthening the semantic associations between images and text. The results show that CoN-CLIP not only improves negation understanding but also enhances overall compositional understanding, making it a more effective foundation model for vision-language tasks. The code for this work is available at the provided GitHub link.