February 19, 2024 | Shengzhi Li, Rongyu Lin, Shichao Pei
This paper addresses the performance degradation of multi-modal large language models (MLLMs) after visual instruction tuning, particularly on language instruction-following benchmarks. The authors collect a lightweight VQA preference dataset annotated by Gemini for five quality metrics and evaluate three alignment methods: Direct Preference Optimization (DPO), SteerLM, and Rejection Sampling. The results show that DPO significantly improves the model's language capabilities, achieving a score of 6.73 on MT-Bench, surpassing Vicuna's 6.57 and LLaVA's 5.99. DPO also enhances visual instruction performance by 4.9% on MM-Vet and 6% on LLaVA-Bench, with minimal alignment tax on visual knowledge benchmarks. The study proposes a distillation-based multi-modal alignment model that reconciles textual and visual performance, restoring and boosting language capability after visual instruction tuning. The main contributions include exploring modality degradation, proposing an innovative preference alignment methodology, and developing an efficient data annotation scheme. The paper discusses the challenges of modality conflict in MLLMs and the limitations of existing alignment methods, highlighting the effectiveness of DPO in addressing these issues.This paper addresses the performance degradation of multi-modal large language models (MLLMs) after visual instruction tuning, particularly on language instruction-following benchmarks. The authors collect a lightweight VQA preference dataset annotated by Gemini for five quality metrics and evaluate three alignment methods: Direct Preference Optimization (DPO), SteerLM, and Rejection Sampling. The results show that DPO significantly improves the model's language capabilities, achieving a score of 6.73 on MT-Bench, surpassing Vicuna's 6.57 and LLaVA's 5.99. DPO also enhances visual instruction performance by 4.9% on MM-Vet and 6% on LLaVA-Bench, with minimal alignment tax on visual knowledge benchmarks. The study proposes a distillation-based multi-modal alignment model that reconciles textual and visual performance, restoring and boosting language capability after visual instruction tuning. The main contributions include exploring modality degradation, proposing an innovative preference alignment methodology, and developing an efficient data annotation scheme. The paper discusses the challenges of modality conflict in MLLMs and the limitations of existing alignment methods, highlighting the effectiveness of DPO in addressing these issues.