8 Feb 2025 | Xiyao Wang, Jiuhan Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha Kass-Hout, Furong Huang, Cao Xiao
This paper proposes SIMA, a self-improvement framework for enhancing visual-language modality alignment in large vision-language models (LVLMs). Unlike existing methods that rely on external models or data, SIMA leverages the model's own capabilities to generate responses and employs an in-context self-critic mechanism to evaluate and refine them. The framework consists of three stages: response self-generation, in-context self-critic, and preference tuning. In the self-generation stage, responses are generated using the model's visual instruction tuning dataset. In the self-critic stage, the model evaluates these responses using a carefully designed critic prompt, which includes three visual critic metrics to guide the evaluation. Preference tuning then updates the model based on the generated preference pairs. The key innovation of SIMA is its ability to self-critic without additional fine-tuning, significantly improving the accuracy of the self-critic. Experimental results show that SIMA significantly improves the performance of LVLMs on 14 hallucination and comprehensive benchmarks, outperforming previous approaches. The framework is applied to LLaVA-1.5 and VILA, achieving notable improvements in hallucination reduction and comprehension capabilities. The paper also discusses the importance of the three visual critic metrics in the self-critic process and presents ablation studies to evaluate the effectiveness of SIMA. The results demonstrate that SIMA effectively enhances modality alignment and reduces hallucinations in LVLMs.This paper proposes SIMA, a self-improvement framework for enhancing visual-language modality alignment in large vision-language models (LVLMs). Unlike existing methods that rely on external models or data, SIMA leverages the model's own capabilities to generate responses and employs an in-context self-critic mechanism to evaluate and refine them. The framework consists of three stages: response self-generation, in-context self-critic, and preference tuning. In the self-generation stage, responses are generated using the model's visual instruction tuning dataset. In the self-critic stage, the model evaluates these responses using a carefully designed critic prompt, which includes three visual critic metrics to guide the evaluation. Preference tuning then updates the model based on the generated preference pairs. The key innovation of SIMA is its ability to self-critic without additional fine-tuning, significantly improving the accuracy of the self-critic. Experimental results show that SIMA significantly improves the performance of LVLMs on 14 hallucination and comprehensive benchmarks, outperforming previous approaches. The framework is applied to LLaVA-1.5 and VILA, achieving notable improvements in hallucination reduction and comprehension capabilities. The paper also discusses the importance of the three visual critic metrics in the self-critic process and presents ablation studies to evaluate the effectiveness of SIMA. The results demonstrate that SIMA effectively enhances modality alignment and reduces hallucinations in LVLMs.