Self-Supervised Visual Preference Alignment

Self-Supervised Visual Preference Alignment

21 Aug 2024 | Ke Zhu, Liang Zhao, Zheng Ge, Xiangyu Zhang
This paper introduces SeVa, a self-supervised visual preference alignment method for improving multi-modal comprehension in Vision-Language Models (VLMs). The method generates chosen and rejected responses for image pairs through direct preference optimization, leveraging image augmentation to induce false but hard negative responses, which helps the model learn more robust and powerful answers. The pipeline does not require supervision from GPT-4 or human involvement, and is highly efficient with minimal code. Using only 8k randomly sampled unsupervised data, SeVa achieves 90% relative score to GPT-4 on complex reasoning in LLaVA-Bench and improves LLaVA-7B/13B by 6.7%/5.6% on MM-Vet. Visualizations show improved alignment with user intentions. A series of ablations reveal the method's potential for further scaling. The code is available at https://github.com/Kevinz-code/SeVa. The paper first discusses the limitations of current VLMs in aligning with user intentions and the challenges of data collection for preference alignment. It then proposes SeVa, which uses self-supervised learning to generate preference data without relying on GPT-4 or human feedback. The method is based on the idea that properly designed image augmentation can induce false but hard negative responses, which help the model learn from and produce more robust answers. The pipeline is simple, efficient, and effective, achieving significant improvements in multi-modal comprehension. The paper also discusses the connection between SeVa and contrastive learning, showing that the method can be viewed as a special form of contrastive learning with one negative sample. Experiments on various benchmarks demonstrate the effectiveness of SeVa, showing improvements in multi-modal comprehension, hallucination reduction, and alignment with user intentions. The method is also shown to be effective in reducing object hallucinations and improving performance on hallucination benchmarks. The paper concludes that SeVa is a promising approach for improving VLMs, with potential for further scaling and application in various domains. The method is simple, efficient, and effective, and has the potential to significantly improve the performance of VLMs in practical applications.This paper introduces SeVa, a self-supervised visual preference alignment method for improving multi-modal comprehension in Vision-Language Models (VLMs). The method generates chosen and rejected responses for image pairs through direct preference optimization, leveraging image augmentation to induce false but hard negative responses, which helps the model learn more robust and powerful answers. The pipeline does not require supervision from GPT-4 or human involvement, and is highly efficient with minimal code. Using only 8k randomly sampled unsupervised data, SeVa achieves 90% relative score to GPT-4 on complex reasoning in LLaVA-Bench and improves LLaVA-7B/13B by 6.7%/5.6% on MM-Vet. Visualizations show improved alignment with user intentions. A series of ablations reveal the method's potential for further scaling. The code is available at https://github.com/Kevinz-code/SeVa. The paper first discusses the limitations of current VLMs in aligning with user intentions and the challenges of data collection for preference alignment. It then proposes SeVa, which uses self-supervised learning to generate preference data without relying on GPT-4 or human feedback. The method is based on the idea that properly designed image augmentation can induce false but hard negative responses, which help the model learn from and produce more robust answers. The pipeline is simple, efficient, and effective, achieving significant improvements in multi-modal comprehension. The paper also discusses the connection between SeVa and contrastive learning, showing that the method can be viewed as a special form of contrastive learning with one negative sample. Experiments on various benchmarks demonstrate the effectiveness of SeVa, showing improvements in multi-modal comprehension, hallucination reduction, and alignment with user intentions. The method is also shown to be effective in reducing object hallucinations and improving performance on hallucination benchmarks. The paper concludes that SeVa is a promising approach for improving VLMs, with potential for further scaling and application in various domains. The method is simple, efficient, and effective, and has the potential to significantly improve the performance of VLMs in practical applications.
Reach us at info@study.space