2025-05-05 | Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
This paper introduces LMMs-Eval, a unified and standardized benchmark suite for evaluating large multimodal models (LMMs), which includes over 50 tasks and more than 10 models. The goal is to provide a transparent, reproducible, and standardized evaluation framework for LMMs. However, the authors note that achieving wide coverage, low cost, and zero contamination in evaluation is a challenging trilemma. To address this, they introduce LMMs-Eval LITE, an efficient benchmark set that maintains evaluation quality while reducing costs. Additionally, they present LIVEBENCH, a benchmark that uses the latest information from news and online forums to assess models' zero-shot generalization ability on recent events, aiming to provide a low-cost and generalizable evaluation approach.
The authors also highlight the issue of data contamination in existing benchmarks, which can skew evaluation results. They propose a method to identify and reduce contamination by analyzing text and image overlaps between training and benchmark data. They also introduce LIVEBENCH, a dynamic evaluation framework that uses a continuously updated dataset to prevent contamination and reduce costs. LIVEBENCH collects data from webpages and uses a pipeline to gather the latest global information from sources like news sites and community forums. The evaluation data is then processed to generate questions and answers (QA) for benchmarking.
The results show that GPT-4 series models, including GPT-4o-mini and GPT-4o, perform well on LIVEBENCH, while the Gemini and Claude series models still outperform open-source models. The authors conclude that while the evaluation trilemma cannot be fully resolved, their work provides practical solutions to navigate the trade-offs in evaluating large multimodal models, paving the way for more effective and reliable benchmarking of LMMs.LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
This paper introduces LMMs-Eval, a unified and standardized benchmark suite for evaluating large multimodal models (LMMs), which includes over 50 tasks and more than 10 models. The goal is to provide a transparent, reproducible, and standardized evaluation framework for LMMs. However, the authors note that achieving wide coverage, low cost, and zero contamination in evaluation is a challenging trilemma. To address this, they introduce LMMs-Eval LITE, an efficient benchmark set that maintains evaluation quality while reducing costs. Additionally, they present LIVEBENCH, a benchmark that uses the latest information from news and online forums to assess models' zero-shot generalization ability on recent events, aiming to provide a low-cost and generalizable evaluation approach.
The authors also highlight the issue of data contamination in existing benchmarks, which can skew evaluation results. They propose a method to identify and reduce contamination by analyzing text and image overlaps between training and benchmark data. They also introduce LIVEBENCH, a dynamic evaluation framework that uses a continuously updated dataset to prevent contamination and reduce costs. LIVEBENCH collects data from webpages and uses a pipeline to gather the latest global information from sources like news sites and community forums. The evaluation data is then processed to generate questions and answers (QA) for benchmarking.
The results show that GPT-4 series models, including GPT-4o-mini and GPT-4o, perform well on LIVEBENCH, while the Gemini and Claude series models still outperform open-source models. The authors conclude that while the evaluation trilemma cannot be fully resolved, their work provides practical solutions to navigate the trade-offs in evaluating large multimodal models, paving the way for more effective and reliable benchmarking of LMMs.