[slides and audio] Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models%3A A Causal Perspective

This paper addresses the issue of unimodal biases in Multimodal Large Language Models (MLLMs), particularly in Visual Question Answering (VQA) tasks. The authors propose a causal framework to interpret and quantify these biases, using a causal graph to elucidate the predictions of MLLMs. They introduce the MORE dataset, which consists of 12,000 VQA instances designed to challenge MLLMs' abilities by requiring multi-hop reasoning and overcoming unimodal biases. The dataset includes multiple-choice questions with distractors targeting language bias, vision bias, and multi-hop reasoning. The authors also propose two strategies to mitigate unimodal biases: a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and fine-tuning open-source MLLMs through causal rationale guidance. Extensive experiments show that MLLMs perform poorly on the MORE dataset, highlighting their reliance on unimodal biases. The proposed strategies significantly improve their performance, demonstrating the effectiveness of the causal approach in enhancing MLLMs' reasoning capabilities.This paper addresses the issue of unimodal biases in Multimodal Large Language Models (MLLMs), particularly in Visual Question Answering (VQA) tasks. The authors propose a causal framework to interpret and quantify these biases, using a causal graph to elucidate the predictions of MLLMs. They introduce the MORE dataset, which consists of 12,000 VQA instances designed to challenge MLLMs' abilities by requiring multi-hop reasoning and overcoming unimodal biases. The dataset includes multiple-choice questions with distractors targeting language bias, vision bias, and multi-hop reasoning. The authors also propose two strategies to mitigate unimodal biases: a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and fine-tuning open-source MLLMs through causal rationale guidance. Extensive experiments show that MLLMs perform poorly on the MORE dataset, highlighting their reliance on unimodal biases. The proposed strategies significantly improve their performance, demonstrating the effectiveness of the causal approach in enhancing MLLMs' reasoning capabilities.

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

3 Apr 2024 | Meiqi Chen, Yixin Cao, Yan Zhang, Chaochao Lu