Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

18 Feb 2024 | Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
This paper introduces POVID, a novel approach to align image and text modalities in Vision Large Language Models (VLLMs) by generating dispreferred responses through preference fine-tuning. VLLMs, which combine strong pre-trained vision models with large language models (LLMs), often suffer from hallucinations—responses that do not accurately reflect the input image. POVID addresses this issue by generating dispreferred data using AI models, such as GPT-4V, to inject plausible hallucinations into correct answers and by distorting images to trigger inherent hallucination patterns. These dispreferred responses are then integrated into an RLHF pipeline via Direct Preference Optimization (DPO), enabling the model to better align image and text modalities. The key contributions of POVID include the use of AI-generated dispreferred data to align image and text modalities in VLLMs, which eliminates the need for human feedback and allows for scalable deployment. Empirical results across various benchmarks show that POVID not only reduces hallucinations but also improves model performance, outperforming prior approaches. The method is effective in enhancing the alignment between image and text modalities, leading to better performance in tasks such as image captioning and vision question answering. Additionally, POVID redirects the attention of VLLMs towards the image modality, resulting in improved modality alignment. The approach is implemented using a two-stage process: generating dispreferred responses through hallucinated text and image distortion, and integrating these into the DPO framework for fine-tuning. The results demonstrate that POVID significantly improves performance compared to other preference tuning methods, achieving an average improvement of 12.4% across benchmarks.This paper introduces POVID, a novel approach to align image and text modalities in Vision Large Language Models (VLLMs) by generating dispreferred responses through preference fine-tuning. VLLMs, which combine strong pre-trained vision models with large language models (LLMs), often suffer from hallucinations—responses that do not accurately reflect the input image. POVID addresses this issue by generating dispreferred data using AI models, such as GPT-4V, to inject plausible hallucinations into correct answers and by distorting images to trigger inherent hallucination patterns. These dispreferred responses are then integrated into an RLHF pipeline via Direct Preference Optimization (DPO), enabling the model to better align image and text modalities. The key contributions of POVID include the use of AI-generated dispreferred data to align image and text modalities in VLLMs, which eliminates the need for human feedback and allows for scalable deployment. Empirical results across various benchmarks show that POVID not only reduces hallucinations but also improves model performance, outperforming prior approaches. The method is effective in enhancing the alignment between image and text modalities, leading to better performance in tasks such as image captioning and vision question answering. Additionally, POVID redirects the attention of VLLMs towards the image modality, resulting in improved modality alignment. The approach is implemented using a two-stage process: generating dispreferred responses through hallucinated text and image distortion, and integrating these into the DPO framework for fine-tuning. The results demonstrate that POVID significantly improves performance compared to other preference tuning methods, achieving an average improvement of 12.4% across benchmarks.
Reach us at info@study.space
[slides] Aligning Modalities in Vision Large Language Models via Preference Fine-tuning | StudySpace