The paper "Are We on the Right Way for Evaluating Large Vision-Language Models?" by Lin Chen et al. addresses two primary issues in the evaluation of large vision-language models (LVLMs): the unnecessary reliance on visual content and unintentional data leakage during training. To address these issues, the authors introduce MMStar, a new multi-modal benchmark that includes 1,500 carefully curated samples, each rigorously validated by humans. MMStar aims to evaluate LVLMs' multi-modal capabilities with balanced and purified samples, covering six core capabilities and 18 detailed axes.
The authors identify that many samples in current benchmarks can be answered without visual content, either through text-based world knowledge or by directly inferring answers from questions and options. This issue is prevalent across various benchmarks, leading to misjudgments of LVLMs' actual multi-modal gains. Additionally, they observe that LVLMs and LVLMs can still answer some visual-necessary questions without visual input, indicating unintentional data leakage during training.
To address these issues, MMStar is designed to ensure that each sample requires advanced multi-modal capabilities for resolution, minimizes data leakage, and exhibits visual dependency. The benchmark evaluates 16 leading LVLMs and proposes two new metrics—multi-modal gain (MG) and multi-modal leakage (ML)—to measure the actual performance gain and data leakage in multi-modal training.
The evaluation results show that even the best LVLMs score under 60 on average, highlighting the need for more effective multi-modal training strategies. The authors also analyze the MG and ML metrics across six popular benchmarks and MMStar, providing valuable insights for the community on gathering multimodal training data and crafting new benchmarks.The paper "Are We on the Right Way for Evaluating Large Vision-Language Models?" by Lin Chen et al. addresses two primary issues in the evaluation of large vision-language models (LVLMs): the unnecessary reliance on visual content and unintentional data leakage during training. To address these issues, the authors introduce MMStar, a new multi-modal benchmark that includes 1,500 carefully curated samples, each rigorously validated by humans. MMStar aims to evaluate LVLMs' multi-modal capabilities with balanced and purified samples, covering six core capabilities and 18 detailed axes.
The authors identify that many samples in current benchmarks can be answered without visual content, either through text-based world knowledge or by directly inferring answers from questions and options. This issue is prevalent across various benchmarks, leading to misjudgments of LVLMs' actual multi-modal gains. Additionally, they observe that LVLMs and LVLMs can still answer some visual-necessary questions without visual input, indicating unintentional data leakage during training.
To address these issues, MMStar is designed to ensure that each sample requires advanced multi-modal capabilities for resolution, minimizes data leakage, and exhibits visual dependency. The benchmark evaluates 16 leading LVLMs and proposes two new metrics—multi-modal gain (MG) and multi-modal leakage (ML)—to measure the actual performance gain and data leakage in multi-modal training.
The evaluation results show that even the best LVLMs score under 60 on average, highlighting the need for more effective multi-modal training strategies. The authors also analyze the MG and ML metrics across six popular benchmarks and MMStar, providing valuable insights for the community on gathering multimodal training data and crafting new benchmarks.