21 Aug 2024 | Ke Zhu1,2 Liang Zhao4 Zheng Ge3,4 Xiangyu Zhang3,4
This paper introduces SeVa (Self-supervised Visual Preference Alignment), a novel approach to unsupervised preference alignment in Vision-Language Models (VLMs). SeVa generates chosen and rejected responses for image pairs, both original and augmented, and uses direct preference optimization to align the model with user intentions. The core idea is that proper image augmentations can induce VLMs to generate false but hard negative responses, enhancing the model's robustness and ability to produce more accurate answers. The pipeline is highly efficient, requiring minimal code and no supervision from GPT-4 or human involvement. With 8k randomly sampled unsupervised data, SeVa achieves a 90% relative score to GPT-4 on complex reasoning tasks in LLaVA-Bench and improves LLaVA-7B/13B by 6.7%/5.6% on the MM-Vet benchmark. Visualizations and ablations demonstrate improved alignment with user intentions and enhanced capabilities such as stronger chain-of-thought skills, better OCR ability, and reduced hallucinations. The method is simple to implement and shows potential for further scaling.This paper introduces SeVa (Self-supervised Visual Preference Alignment), a novel approach to unsupervised preference alignment in Vision-Language Models (VLMs). SeVa generates chosen and rejected responses for image pairs, both original and augmented, and uses direct preference optimization to align the model with user intentions. The core idea is that proper image augmentations can induce VLMs to generate false but hard negative responses, enhancing the model's robustness and ability to produce more accurate answers. The pipeline is highly efficient, requiring minimal code and no supervision from GPT-4 or human involvement. With 8k randomly sampled unsupervised data, SeVa achieves a 90% relative score to GPT-4 on complex reasoning tasks in LLaVA-Bench and improves LLaVA-7B/13B by 6.7%/5.6% on the MM-Vet benchmark. Visualizations and ablations demonstrate improved alignment with user intentions and enhanced capabilities such as stronger chain-of-thought skills, better OCR ability, and reduced hallucinations. The method is simple to implement and shows potential for further scaling.