Are We on the Right Way for Evaluating Large Vision-Language Models?

Are We on the Right Way for Evaluating Large Vision-Language Models?

9 Apr 2024 | Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao
The paper addresses the challenges in evaluating large vision-language models (LVLMs) by identifying two critical issues: 1) many evaluation samples do not require visual content, as answers can be derived from text-based knowledge or the question itself, and 2) unintentional data leakage occurs during training, where models can answer visual-necessary questions without visual input. These issues lead to inaccurate assessments of LVLMs' multi-modal capabilities. To address these, the authors introduce MMStar, a new benchmark with 1,500 samples carefully selected to ensure visual dependency and minimal data leakage. MMStar evaluates six core capabilities and 18 detailed axes, aiming to provide a balanced and purified set of samples for assessing LVLMs' multi-modal performance. Two metrics, multi-modal gain (MG) and multi-modal leakage (ML), are proposed to measure actual performance gains and data leakage during training. The authors evaluate 16 leading LVLMs on MMStar and find that GPT-4V achieves the highest accuracy, with strong MG and low ML. The results highlight the importance of addressing data leakage and ensuring visual dependency in evaluation samples to fairly assess LVLMs' capabilities. The study emphasizes the need for more rigorous and balanced benchmarks to accurately evaluate the multi-modal performance of LVLMs.The paper addresses the challenges in evaluating large vision-language models (LVLMs) by identifying two critical issues: 1) many evaluation samples do not require visual content, as answers can be derived from text-based knowledge or the question itself, and 2) unintentional data leakage occurs during training, where models can answer visual-necessary questions without visual input. These issues lead to inaccurate assessments of LVLMs' multi-modal capabilities. To address these, the authors introduce MMStar, a new benchmark with 1,500 samples carefully selected to ensure visual dependency and minimal data leakage. MMStar evaluates six core capabilities and 18 detailed axes, aiming to provide a balanced and purified set of samples for assessing LVLMs' multi-modal performance. Two metrics, multi-modal gain (MG) and multi-modal leakage (ML), are proposed to measure actual performance gains and data leakage during training. The authors evaluate 16 leading LVLMs on MMStar and find that GPT-4V achieves the highest accuracy, with strong MG and low ML. The results highlight the importance of addressing data leakage and ensuring visual dependency in evaluation samples to fairly assess LVLMs' capabilities. The study emphasizes the need for more rigorous and balanced benchmarks to accurately evaluate the multi-modal performance of LVLMs.
Reach us at info@study.space
[slides] Are We on the Right Way for Evaluating Large Vision-Language Models%3F | StudySpace