Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

3 Apr 2024 | Meiqi Chen, Yixin Cao, Yan Zhang, Chaohao Lu
This paper presents a causal framework for quantifying and mitigating unimodal biases in Multimodal Large Language Models (MLLMs), focusing on Visual Question Answering (VQA) tasks. The authors propose a causal graph to analyze how MLLMs rely on language and vision biases, leading to incorrect answers in complex tasks. They introduce the MORE dataset, consisting of 12,000 VQA instances, designed to challenge MLLMs by requiring multi-hop reasoning and overcoming unimodal biases. The dataset includes multiple-choice questions with options targeting language bias, vision bias, and multi-hop reasoning, along with causal rationales for interpretability. The authors propose two strategies to mitigate unimodal biases: a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and fine-tuning of open-source MLLMs using the MORE dataset. The DeVA framework helps MLLMs break down complex questions, verify answers, and avoid spurious paths. Fine-tuning improves the reasoning capabilities of open-source models like LLaVA. Experiments on six leading MLLMs show that most models perform poorly on MORE, indicating a strong reliance on unimodal biases. The results highlight the challenges MLLMs face in multi-modal reasoning and the importance of addressing biases. The study also demonstrates that the DeVA framework and fine-tuned models significantly improve performance on VQA tasks, particularly in overcoming language and vision biases. The MORE dataset provides a comprehensive benchmark for evaluating MLLMs, emphasizing the need for robust reasoning and bias mitigation in multimodal systems.This paper presents a causal framework for quantifying and mitigating unimodal biases in Multimodal Large Language Models (MLLMs), focusing on Visual Question Answering (VQA) tasks. The authors propose a causal graph to analyze how MLLMs rely on language and vision biases, leading to incorrect answers in complex tasks. They introduce the MORE dataset, consisting of 12,000 VQA instances, designed to challenge MLLMs by requiring multi-hop reasoning and overcoming unimodal biases. The dataset includes multiple-choice questions with options targeting language bias, vision bias, and multi-hop reasoning, along with causal rationales for interpretability. The authors propose two strategies to mitigate unimodal biases: a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and fine-tuning of open-source MLLMs using the MORE dataset. The DeVA framework helps MLLMs break down complex questions, verify answers, and avoid spurious paths. Fine-tuning improves the reasoning capabilities of open-source models like LLaVA. Experiments on six leading MLLMs show that most models perform poorly on MORE, indicating a strong reliance on unimodal biases. The results highlight the challenges MLLMs face in multi-modal reasoning and the importance of addressing biases. The study also demonstrates that the DeVA framework and fine-tuned models significantly improve performance on VQA tasks, particularly in overcoming language and vision biases. The MORE dataset provides a comprehensive benchmark for evaluating MLLMs, emphasizing the need for robust reasoning and bias mitigation in multimodal systems.
Reach us at info@study.space