Calibrated Self-Rewarding Vision Language Models

Calibrated Self-Rewarding Vision Language Models

31 May 2024 | Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao
Calibrated Self-Rewarding (CSR) is a novel approach for improving modality alignment in Large Vision-Language Models (LVLMs). The method enables LVLMs to self-improve by iteratively generating candidate responses, evaluating their rewards, and curating preference data for fine-tuning. CSR integrates visual constraints into the self-rewarding process, emphasizing visual input and reducing hallucinations. The approach uses a step-wise reward strategy, combining self-generated instruction-following scores with image-response relevance scores to calibrate rewards. This calibration helps LVLMs focus more on visual information, leading to better alignment between image and text modalities. Empirical results show that CSR significantly improves performance on various benchmarks and tasks, achieving a 7.62% improvement over existing methods. The method is compatible with different LVLMs and can incrementally improve performance through iterative fine-tuning. Theoretical analysis supports the effectiveness of CSR, showing that incorporating visual constraints in the self-rewarding paradigm enhances performance under mild assumptions. CSR also demonstrates the ability to align image and text modalities by adjusting attention weights and preference pairs, leading to more accurate and factually consistent responses. The method is implemented using LLaVA-1.5 as the backbone model and has been tested on multiple benchmarks, including comprehensive benchmarks, general VQA tasks, and hallucination benchmarks. CSR outperforms other preference fine-tuning baselines and is effective in improving modality alignment in LVLMs. The approach is supported by extensive experiments and theoretical analysis, demonstrating its effectiveness in enhancing LVLM performance and reducing hallucinations.Calibrated Self-Rewarding (CSR) is a novel approach for improving modality alignment in Large Vision-Language Models (LVLMs). The method enables LVLMs to self-improve by iteratively generating candidate responses, evaluating their rewards, and curating preference data for fine-tuning. CSR integrates visual constraints into the self-rewarding process, emphasizing visual input and reducing hallucinations. The approach uses a step-wise reward strategy, combining self-generated instruction-following scores with image-response relevance scores to calibrate rewards. This calibration helps LVLMs focus more on visual information, leading to better alignment between image and text modalities. Empirical results show that CSR significantly improves performance on various benchmarks and tasks, achieving a 7.62% improvement over existing methods. The method is compatible with different LVLMs and can incrementally improve performance through iterative fine-tuning. Theoretical analysis supports the effectiveness of CSR, showing that incorporating visual constraints in the self-rewarding paradigm enhances performance under mild assumptions. CSR also demonstrates the ability to align image and text modalities by adjusting attention weights and preference pairs, leading to more accurate and factually consistent responses. The method is implemented using LLaVA-1.5 as the backbone model and has been tested on multiple benchmarks, including comprehensive benchmarks, general VQA tasks, and hallucination benchmarks. CSR outperforms other preference fine-tuning baselines and is effective in improving modality alignment in LVLMs. The approach is supported by extensive experiments and theoretical analysis, demonstrating its effectiveness in enhancing LVLM performance and reducing hallucinations.
Reach us at info@study.space
[slides and audio] Calibrated Self-Rewarding Vision Language Models