12 Jun 2024 | Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee
CounterCurate is a framework designed to enhance the visio-linguistic compositional reasoning capabilities of both contrastive and generative multimodal models. The framework addresses two critical under-explored problems: the neglect of physically grounded reasoning (such as counting and position understanding) and the potential of using high-capability text and image generation models for semantic counterfactual fine-tuning.
The paper introduces a new benchmark, Flickr30k-Positions, which evaluates models' ability to understand positional relationships between objects. It also introduces Flickr30k-Counting to improve object counting capabilities and Flickr30k-Attributes to enhance semantic compositional reasoning. The framework generates counterfactual image-text pairs using data augmentation and grounded image generation models, such as GLIGEN, to create challenging examples for fine-tuning.
CounterCurate significantly improves the performance of models like CLIP and LLaVA on tasks involving physical and semantic reasoning. It outperforms existing methods on benchmarks such as SugarCrepe and demonstrates that using high-capability text and image generation models for counterfactual fine-tuning is effective. The framework also shows that incorporating both negative images and captions, along with grouping strategies, leads to significant improvements in model performance.
The paper evaluates the effectiveness of CounterCurate on various tasks, including positional understanding, object counting, and semantic compositional reasoning. It demonstrates that the framework significantly enhances the reasoning capabilities of both contrastive and generative models. The results show that CounterCurate improves the performance of models like CLIP and LLaVA, surpassing existing methods in several benchmarks. The framework is also shown to be effective in zero-shot vision-language tasks and does not negatively impact the performance of downstream tasks. The paper concludes that CounterCurate is a valid candidate for improving state-of-the-art large multimodal models.CounterCurate is a framework designed to enhance the visio-linguistic compositional reasoning capabilities of both contrastive and generative multimodal models. The framework addresses two critical under-explored problems: the neglect of physically grounded reasoning (such as counting and position understanding) and the potential of using high-capability text and image generation models for semantic counterfactual fine-tuning.
The paper introduces a new benchmark, Flickr30k-Positions, which evaluates models' ability to understand positional relationships between objects. It also introduces Flickr30k-Counting to improve object counting capabilities and Flickr30k-Attributes to enhance semantic compositional reasoning. The framework generates counterfactual image-text pairs using data augmentation and grounded image generation models, such as GLIGEN, to create challenging examples for fine-tuning.
CounterCurate significantly improves the performance of models like CLIP and LLaVA on tasks involving physical and semantic reasoning. It outperforms existing methods on benchmarks such as SugarCrepe and demonstrates that using high-capability text and image generation models for counterfactual fine-tuning is effective. The framework also shows that incorporating both negative images and captions, along with grouping strategies, leads to significant improvements in model performance.
The paper evaluates the effectiveness of CounterCurate on various tasks, including positional understanding, object counting, and semantic compositional reasoning. It demonstrates that the framework significantly enhances the reasoning capabilities of both contrastive and generative models. The results show that CounterCurate improves the performance of models like CLIP and LLaVA, surpassing existing methods in several benchmarks. The framework is also shown to be effective in zero-shot vision-language tasks and does not negatively impact the performance of downstream tasks. The paper concludes that CounterCurate is a valid candidate for improving state-of-the-art large multimodal models.