8 Feb 2025 | Xiyao Wang, Jiahai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha Kass-Hout, Furong Huang, Cao Xiao
The paper introduces SIMA (Self-Improvement Modality Alignment), a novel framework designed to enhance the alignment between visual and language modalities in Large Vision Language Models (LVLMs). SIMA aims to improve modality alignment without relying on external models or data, addressing the challenges of distribution shifts and high costs associated with traditional methods. The framework consists of three stages: response self-generation, in-context self-critic, and preference tuning. During response self-generation, the model generates diverse responses using prompts from the visual instruction tuning dataset. The in-context self-critic stage allows the model to evaluate these responses and form preference pairs, while the preference tuning stage updates the model based on these pairs. Key innovations include the use of visual critic metrics (Accuracy in Object Description, Accuracy in Depicting Relationships, and Accuracy in Describing Attributes) to guide the evaluation process and the ability of the model to act as its own critic without additional fine-tuning. Extensive experiments on 14 hallucination and comprehensive benchmarks demonstrate that SIMA significantly improves the performance of LVLMs, reducing hallucinations and enhancing comprehension capabilities.The paper introduces SIMA (Self-Improvement Modality Alignment), a novel framework designed to enhance the alignment between visual and language modalities in Large Vision Language Models (LVLMs). SIMA aims to improve modality alignment without relying on external models or data, addressing the challenges of distribution shifts and high costs associated with traditional methods. The framework consists of three stages: response self-generation, in-context self-critic, and preference tuning. During response self-generation, the model generates diverse responses using prompts from the visual instruction tuning dataset. The in-context self-critic stage allows the model to evaluate these responses and form preference pairs, while the preference tuning stage updates the model based on these pairs. Key innovations include the use of visual critic metrics (Accuracy in Object Description, Accuracy in Depicting Relationships, and Accuracy in Describing Attributes) to guide the evaluation process and the ability of the model to act as its own critic without additional fine-tuning. Extensive experiments on 14 hallucination and comprehensive benchmarks demonstrate that SIMA significantly improves the performance of LVLMs, reducing hallucinations and enhancing comprehension capabilities.