Multi-modal preference alignment remedies regression of visual instruction tuning on language model

Multi-modal preference alignment remedies regression of visual instruction tuning on language model

February 19, 2024 | Shengzhi Li, Rongyu Lin, Shichao Pei
This paper addresses the degradation of language instruction-following capabilities in multi-modal large language models (MLLMs) after visual instruction tuning. The study investigates the effectiveness of distillation-based preference alignment methods to mitigate this issue. A lightweight VQA preference dataset with 6,000 entries was collected, annotated by Gemini for five quality metrics. The study evaluates several alignment methods, including Direct Preference Optimization (DPO), SteerLM, and Rejection Sampling, to enhance instruction-following capabilities and reduce modality conflicts in MLLMs. The results show that DPO outperforms other methods in restoring and improving language instruction-following capabilities, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This improvement correlates with enhanced visual instruction performance, with a +4.9% increase on MM-Vet and +6% on LLaVA-Bench. DPO also demonstrates minimal alignment tax on visual knowledge benchmarks compared to previous RLHF approaches. The study proposes a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs. This approach restores and boosts language capability after visual instruction tuning. The paper also highlights the challenges of modality conflict in MLLMs and the limitations of existing datasets, emphasizing the need for scalable and objective preference data collection methods. The study evaluates various benchmarks, including visual instruction, visual multi-choice, and language instruction-following benchmarks. Results show that DPO significantly outperforms baseline models on open-ended visual instruction tasks and improves performance on multi-modal benchmarks. However, it slightly decreases LLaVA's MM-Bench score compared to LLaVA-RLHF, indicating a less significant alignment tax. The paper concludes that DPO is an effective and data-efficient method for enhancing MLLM performance, with minimal impact on existing knowledge. It also highlights the importance of addressing the subjectivity and bias in preference datasets and the need for scalable and objective data collection methods. The study provides insights into the challenges of modality conflict in MLLMs and the potential of preference alignment as a solution.This paper addresses the degradation of language instruction-following capabilities in multi-modal large language models (MLLMs) after visual instruction tuning. The study investigates the effectiveness of distillation-based preference alignment methods to mitigate this issue. A lightweight VQA preference dataset with 6,000 entries was collected, annotated by Gemini for five quality metrics. The study evaluates several alignment methods, including Direct Preference Optimization (DPO), SteerLM, and Rejection Sampling, to enhance instruction-following capabilities and reduce modality conflicts in MLLMs. The results show that DPO outperforms other methods in restoring and improving language instruction-following capabilities, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This improvement correlates with enhanced visual instruction performance, with a +4.9% increase on MM-Vet and +6% on LLaVA-Bench. DPO also demonstrates minimal alignment tax on visual knowledge benchmarks compared to previous RLHF approaches. The study proposes a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs. This approach restores and boosts language capability after visual instruction tuning. The paper also highlights the challenges of modality conflict in MLLMs and the limitations of existing datasets, emphasizing the need for scalable and objective preference data collection methods. The study evaluates various benchmarks, including visual instruction, visual multi-choice, and language instruction-following benchmarks. Results show that DPO significantly outperforms baseline models on open-ended visual instruction tasks and improves performance on multi-modal benchmarks. However, it slightly decreases LLaVA's MM-Bench score compared to LLaVA-RLHF, indicating a less significant alignment tax. The paper concludes that DPO is an effective and data-efficient method for enhancing MLLM performance, with minimal impact on existing knowledge. It also highlights the importance of addressing the subjectivity and bias in preference datasets and the need for scalable and objective data collection methods. The study provides insights into the challenges of modality conflict in MLLMs and the potential of preference alignment as a solution.
Reach us at info@study.space
Understanding Multi-modal preference alignment remedies regression of visual instruction tuning on language model