The paper introduces the Calibrated Self-Rewarding (CSR) approach to address the modality misalignment issue in Large Vision-Language Models (LVLMs). LVLMs often generate text responses that contradict the input image, a phenomenon known as hallucination. This issue arises because LVLMs tend to prioritize textual information over visual input, even when both are of high quality. Existing methods, such as preference fine-tuning, rely on additional models or human annotations, which are resource-intensive and may not effectively capture the target LVLM's preferences.
CSR proposes a novel approach that enables the model to self-improve by iteratively generating candidate responses, evaluating rewards, and curating preference data for fine-tuning. The key innovation is the incorporation of visual constraints into the self-rewarding process, emphasizing visual input over textual information. The reward modeling is step-wise, and visual relevance scores are used to calibrate the initial rewards generated by the model itself.
Empirical results show that CSR significantly enhances performance and reduces hallucinations across various benchmarks and tasks, achieving a 7.62% improvement over existing methods. Theoretical analysis supports the effectiveness of CSR, demonstrating that introducing visual constraints into the self-rewarding paradigm can improve performance under mild assumptions. CSR is also shown to be compatible with different vision-language models and can continuously improve performance through iterative fine-tuning.The paper introduces the Calibrated Self-Rewarding (CSR) approach to address the modality misalignment issue in Large Vision-Language Models (LVLMs). LVLMs often generate text responses that contradict the input image, a phenomenon known as hallucination. This issue arises because LVLMs tend to prioritize textual information over visual input, even when both are of high quality. Existing methods, such as preference fine-tuning, rely on additional models or human annotations, which are resource-intensive and may not effectively capture the target LVLM's preferences.
CSR proposes a novel approach that enables the model to self-improve by iteratively generating candidate responses, evaluating rewards, and curating preference data for fine-tuning. The key innovation is the incorporation of visual constraints into the self-rewarding process, emphasizing visual input over textual information. The reward modeling is step-wise, and visual relevance scores are used to calibrate the initial rewards generated by the model itself.
Empirical results show that CSR significantly enhances performance and reduces hallucinations across various benchmarks and tasks, achieving a 7.62% improvement over existing methods. Theoretical analysis supports the effectiveness of CSR, demonstrating that introducing visual constraints into the self-rewarding paradigm can improve performance under mild assumptions. CSR is also shown to be compatible with different vision-language models and can continuously improve performance through iterative fine-tuning.