[slides] Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

This paper addresses the issue of hallucinations in Vision Large Language Models (VLLMs), which occur when the model's responses do not accurately reflect the input image. The authors propose POVID (Preference Optimization in VLLM with AI-Generated Dispreferences), a method that generates dispreferred feedback data using AI models to improve modality alignment between image and text. POVID employs two strategies: first, it uses GPT-4V to inject plausible hallucinations into the correct response, and second, it distorts the image to trigger inherent hallucination patterns in the VLLM. These dispreferred responses are then integrated into an RLHF (Reinforcement Learning from Human Feedback) pipeline via Direct Preference Optimization (DPO). The paper demonstrates that POVID effectively reduces hallucinations and improves model performance across various benchmarks, outperforming other preference tuning methods. The approach is scalable and does not rely on human data generation, making it suitable for large-scale deployment. The authors also provide a detailed analysis of the effectiveness of POVID through ablation studies and modality alignment analysis, showing that it significantly enhances the model's ability to align image and text modalities.This paper addresses the issue of hallucinations in Vision Large Language Models (VLLMs), which occur when the model's responses do not accurately reflect the input image. The authors propose POVID (Preference Optimization in VLLM with AI-Generated Dispreferences), a method that generates dispreferred feedback data using AI models to improve modality alignment between image and text. POVID employs two strategies: first, it uses GPT-4V to inject plausible hallucinations into the correct response, and second, it distorts the image to trigger inherent hallucination patterns in the VLLM. These dispreferred responses are then integrated into an RLHF (Reinforcement Learning from Human Feedback) pipeline via Direct Preference Optimization (DPO). The paper demonstrates that POVID effectively reduces hallucinations and improves model performance across various benchmarks, outperforming other preference tuning methods. The approach is scalable and does not rely on human data generation, making it suitable for large-scale deployment. The authors also provide a detailed analysis of the effectiveness of POVID through ablation studies and modality alignment analysis, showing that it significantly enhances the model's ability to align image and text modalities.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

18 Feb 2024 | Yiyang Zhou * 1 Chenhang Cui * 1 Rafael Rafaилov 2 Chelsea Finn 2 Huaxiu Yao 1