12 Jun 2024 | Jianrui Zhang*†, Mu Cai*†, Tengyang Xie1,2, Yong Jae Lee1
**CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples**
**Authors:** Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee
**Institution:** University of Wisconsin–Madison, Microsoft Research
**Abstract:**
CounterCurate is a framework designed to enhance the visio-linguistic compositional reasoning capabilities of both contrastive and generative multimodal models. It addresses two critical under-explored problems: the neglect of physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. The framework uses simple data augmentation techniques and grounded image generation models like GLIGEN to generate fine-tuning data, significantly improving performance on benchmarks such as Flickr30k-Positions. Additionally, it leverages high-performing text generation models like GPT-4V and image generation models like DALLE-3 to curate challenging semantic counterfactuals, further enhancing compositional reasoning on benchmarks like SugarCrepe. The framework's effectiveness is demonstrated through experiments on various datasets, showing significant improvements over state-of-the-art models.
**Contributions:**
- Systematically studies physically grounded compositional reasoning and highlights the near-random performance of multimodal models on curated benchmarks.
- Significantly improves physical reasoning capabilities by generating counterfactual images and captions using data augmentation and grounded image inpainting.
- Enhances semantic compositional reasoning by curating challenging image-text pairs using advanced text and image generation models.
**Methods:**
- **Flickr30k-Positions:** Evaluates positional understanding by generating negative captions and images for left-right and above-below positions.
- **Flickr30k-Counting:** Enhances object counting capabilities by generating negative captions and images for counting tasks.
- **Flickr30k-Attributes:** Improves semantic compositional reasoning by generating hard negative captions and images for attribute manipulation.
**Experiments:**
- **Flickr30k-Positions:** CLIP and LLaVA show significant improvements after fine-tuning.
- **Flickr30k-Counting:** Both models improve their counting capabilities.
- **Flickr30k-Attributes:** CounterCurate-finetuned models outperform both CLIP and LLaVA on average and in specific categories.
**Conclusion:**
CounterCurate effectively enhances the visio-linguistic compositional reasoning capabilities of multimodal models by addressing the neglect of physically grounded reasoning and leveraging advanced text and image generation models for semantic counterfactual fine-tuning.**CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples**
**Authors:** Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee
**Institution:** University of Wisconsin–Madison, Microsoft Research
**Abstract:**
CounterCurate is a framework designed to enhance the visio-linguistic compositional reasoning capabilities of both contrastive and generative multimodal models. It addresses two critical under-explored problems: the neglect of physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. The framework uses simple data augmentation techniques and grounded image generation models like GLIGEN to generate fine-tuning data, significantly improving performance on benchmarks such as Flickr30k-Positions. Additionally, it leverages high-performing text generation models like GPT-4V and image generation models like DALLE-3 to curate challenging semantic counterfactuals, further enhancing compositional reasoning on benchmarks like SugarCrepe. The framework's effectiveness is demonstrated through experiments on various datasets, showing significant improvements over state-of-the-art models.
**Contributions:**
- Systematically studies physically grounded compositional reasoning and highlights the near-random performance of multimodal models on curated benchmarks.
- Significantly improves physical reasoning capabilities by generating counterfactual images and captions using data augmentation and grounded image inpainting.
- Enhances semantic compositional reasoning by curating challenging image-text pairs using advanced text and image generation models.
**Methods:**
- **Flickr30k-Positions:** Evaluates positional understanding by generating negative captions and images for left-right and above-below positions.
- **Flickr30k-Counting:** Enhances object counting capabilities by generating negative captions and images for counting tasks.
- **Flickr30k-Attributes:** Improves semantic compositional reasoning by generating hard negative captions and images for attribute manipulation.
**Experiments:**
- **Flickr30k-Positions:** CLIP and LLaVA show significant improvements after fine-tuning.
- **Flickr30k-Counting:** Both models improve their counting capabilities.
- **Flickr30k-Attributes:** CounterCurate-finetuned models outperform both CLIP and LLaVA on average and in specific categories.
**Conclusion:**
CounterCurate effectively enhances the visio-linguistic compositional reasoning capabilities of multimodal models by addressing the neglect of physically grounded reasoning and leveraging advanced text and image generation models for semantic counterfactual fine-tuning.