17 Jun 2024 | Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muha Chen
MDPO is a multimodal preference optimization method designed to address the issue of unconditional preference in large language models (LLMs). The method introduces two additional objectives: conditional preference optimization and anchored preference optimization. Conditional preference optimization ensures that the model learns to prioritize visual information, while anchored preference optimization prevents the likelihood of preferred responses from decreasing. These objectives work together to enhance the model's ability to understand and respond to multimodal inputs, reducing hallucinations and improving overall performance.
The paper identifies that the failure of direct preference optimization (DPO) in multimodal scenarios is not solely due to data quality but also due to a systematic gap between theoretical expectations and practical implementations. DPO may prioritize language-only preferences and overlook the image condition, leading to suboptimal performance and increased hallucination. MDPO addresses this by incorporating conditional preference optimization, which forces the model to consider visual information when making preference decisions.
Experiments on two multimodal LLMs, Bunny-v1.0-3B and LLaVA-v1.5-7B, demonstrate that MDPO significantly improves model performance, particularly in reducing hallucinations. The method outperforms standard DPO across various benchmarks, including MMHalBench, Object HalBench, and AMBER. Detailed analysis shows that conditional preference optimization plays a crucial role in enhancing the effectiveness of DPO for multimodal LLMs. Fine-grained and qualitative studies further illustrate that MDPO significantly improves the model's ability to comprehend images and mitigates language biases in model responses.
MDPO is effective across different model and data scales, achieving the best-performing 3B multimodal LLM in terms of reducing hallucinations. The method introduces an anchor to regularize the reward, ensuring that the probability of the chosen response does not decrease. This approach enhances the model's ability to learn from multimodal preference data, leading to better alignment with human preferences. The results show that MDPO consistently enhances multimodal LLM performance and reduces hallucination across different model sizes on three widely used benchmarks.MDPO is a multimodal preference optimization method designed to address the issue of unconditional preference in large language models (LLMs). The method introduces two additional objectives: conditional preference optimization and anchored preference optimization. Conditional preference optimization ensures that the model learns to prioritize visual information, while anchored preference optimization prevents the likelihood of preferred responses from decreasing. These objectives work together to enhance the model's ability to understand and respond to multimodal inputs, reducing hallucinations and improving overall performance.
The paper identifies that the failure of direct preference optimization (DPO) in multimodal scenarios is not solely due to data quality but also due to a systematic gap between theoretical expectations and practical implementations. DPO may prioritize language-only preferences and overlook the image condition, leading to suboptimal performance and increased hallucination. MDPO addresses this by incorporating conditional preference optimization, which forces the model to consider visual information when making preference decisions.
Experiments on two multimodal LLMs, Bunny-v1.0-3B and LLaVA-v1.5-7B, demonstrate that MDPO significantly improves model performance, particularly in reducing hallucinations. The method outperforms standard DPO across various benchmarks, including MMHalBench, Object HalBench, and AMBER. Detailed analysis shows that conditional preference optimization plays a crucial role in enhancing the effectiveness of DPO for multimodal LLMs. Fine-grained and qualitative studies further illustrate that MDPO significantly improves the model's ability to comprehend images and mitigates language biases in model responses.
MDPO is effective across different model and data scales, achieving the best-performing 3B multimodal LLM in terms of reducing hallucinations. The method introduces an anchor to regularize the reward, ensuring that the probability of the chosen response does not decrease. This approach enhances the model's ability to learn from multimodal preference data, leading to better alignment with human preferences. The results show that MDPO consistently enhances multimodal LLM performance and reduces hallucination across different model sizes on three widely used benchmarks.