[slides and audio] CounterCurate%3A Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

**CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples** **Authors:** Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee **Institution:** University of Wisconsin–Madison, Microsoft Research **Abstract:** CounterCurate is a framework designed to enhance the visio-linguistic compositional reasoning capabilities of both contrastive and generative multimodal models. It addresses two critical under-explored problems: the neglect of physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. The framework uses simple data augmentation techniques and grounded image generation models like GLIGEN to generate fine-tuning data, significantly improving performance on benchmarks such as Flickr30k-Positions. Additionally, it leverages high-performing text generation models like GPT-4V and image generation models like DALLE-3 to curate challenging semantic counterfactuals, further enhancing compositional reasoning on benchmarks like SugarCrepe. The framework's effectiveness is demonstrated through experiments on various datasets, showing significant improvements over state-of-the-art models. **Contributions:** - Systematically studies physically grounded compositional reasoning and highlights the near-random performance of multimodal models on curated benchmarks. - Significantly improves physical reasoning capabilities by generating counterfactual images and captions using data augmentation and grounded image inpainting. - Enhances semantic compositional reasoning by curating challenging image-text pairs using advanced text and image generation models. **Methods:** - **Flickr30k-Positions:** Evaluates positional understanding by generating negative captions and images for left-right and above-below positions. - **Flickr30k-Counting:** Enhances object counting capabilities by generating negative captions and images for counting tasks. - **Flickr30k-Attributes:** Improves semantic compositional reasoning by generating hard negative captions and images for attribute manipulation. **Experiments:** - **Flickr30k-Positions:** CLIP and LLaVA show significant improvements after fine-tuning. - **Flickr30k-Counting:** Both models improve their counting capabilities. - **Flickr30k-Attributes:** CounterCurate-finetuned models outperform both CLIP and LLaVA on average and in specific categories. **Conclusion:** CounterCurate effectively enhances the visio-linguistic compositional reasoning capabilities of multimodal models by addressing the neglect of physically grounded reasoning and leveraging advanced text and image generation models for semantic counterfactual fine-tuning.**CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples** **Authors:** Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee **Institution:** University of Wisconsin–Madison, Microsoft Research **Abstract:** CounterCurate is a framework designed to enhance the visio-linguistic compositional reasoning capabilities of both contrastive and generative multimodal models. It addresses two critical under-explored problems: the neglect of physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. The framework uses simple data augmentation techniques and grounded image generation models like GLIGEN to generate fine-tuning data, significantly improving performance on benchmarks such as Flickr30k-Positions. Additionally, it leverages high-performing text generation models like GPT-4V and image generation models like DALLE-3 to curate challenging semantic counterfactuals, further enhancing compositional reasoning on benchmarks like SugarCrepe. The framework's effectiveness is demonstrated through experiments on various datasets, showing significant improvements over state-of-the-art models. **Contributions:** - Systematically studies physically grounded compositional reasoning and highlights the near-random performance of multimodal models on curated benchmarks. - Significantly improves physical reasoning capabilities by generating counterfactual images and captions using data augmentation and grounded image inpainting. - Enhances semantic compositional reasoning by curating challenging image-text pairs using advanced text and image generation models. **Methods:** - **Flickr30k-Positions:** Evaluates positional understanding by generating negative captions and images for left-right and above-below positions. - **Flickr30k-Counting:** Enhances object counting capabilities by generating negative captions and images for counting tasks. - **Flickr30k-Attributes:** Improves semantic compositional reasoning by generating hard negative captions and images for attribute manipulation. **Experiments:** - **Flickr30k-Positions:** CLIP and LLaVA show significant improvements after fine-tuning. - **Flickr30k-Counting:** Both models improve their counting capabilities. - **Flickr30k-Attributes:** CounterCurate-finetuned models outperform both CLIP and LLaVA on average and in specific categories. **Conclusion:** CounterCurate effectively enhances the visio-linguistic compositional reasoning capabilities of multimodal models by addressing the neglect of physically grounded reasoning and leveraging advanced text and image generation models for semantic counterfactual fine-tuning.

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

12 Jun 2024 | Jianrui Zhang*†, Mu Cai*†, Tengyang Xie1,2, Yong Jae Lee1

12 Jun 2024 | Jianrui Zhang†, Mu Cai†, Tengyang Xie1,2, Yong Jae Lee1