[slides] LMMs-Eval%3A Reality Check on the Evaluation of Large Multimodal Models

The paper "LMMs-EVAL: Reality Check on the Evaluation of Large Multimodal Models" addresses the challenges in evaluating Large Multimodal Models (LMMs) by introducing a unified and standardized benchmark suite, LMMs-EVAL, and two additional tools: LMMs-EVAL LITE and LIVEBENCH. The authors aim to provide a comprehensive, transparent, and reproducible evaluation framework for LMMs. 1. **LMMs-EVAL**: A unified and standardized multimodal benchmark suite covering over 50 tasks and more than 10 models, ensuring wide coverage and standardization in evaluations. 2. **LMMs-EVAL LITE**: An efficient evaluation toolkit that provides reliable and aligned results with the full-set evaluation, reducing costs while maintaining quality. 3. **LIVEBENCH**: An evaluation benchmark that uses continuously updated news and forum websites to assess models' zero-shot generalization ability on recent events, addressing data contamination issues. The paper highlights the importance of considering the evaluation trilemma—wide coverage, low cost, and zero contamination—and provides practical solutions to navigate this trade-off. It also discusses the limitations of existing benchmarks and the need for more comprehensive evaluations to accurately assess model performance. The authors open-source their codebase and maintain a leaderboard for LIVEBENCH at GitHub and LiveBench.The paper "LMMs-EVAL: Reality Check on the Evaluation of Large Multimodal Models" addresses the challenges in evaluating Large Multimodal Models (LMMs) by introducing a unified and standardized benchmark suite, LMMs-EVAL, and two additional tools: LMMs-EVAL LITE and LIVEBENCH. The authors aim to provide a comprehensive, transparent, and reproducible evaluation framework for LMMs. 1. **LMMs-EVAL**: A unified and standardized multimodal benchmark suite covering over 50 tasks and more than 10 models, ensuring wide coverage and standardization in evaluations. 2. **LMMs-EVAL LITE**: An efficient evaluation toolkit that provides reliable and aligned results with the full-set evaluation, reducing costs while maintaining quality. 3. **LIVEBENCH**: An evaluation benchmark that uses continuously updated news and forum websites to assess models' zero-shot generalization ability on recent events, addressing data contamination issues. The paper highlights the importance of considering the evaluation trilemma—wide coverage, low cost, and zero contamination—and provides practical solutions to navigate this trade-off. It also discusses the limitations of existing benchmarks and the need for more comprehensive evaluations to accurately assess model performance. The authors open-source their codebase and maintain a leaderboard for LIVEBENCH at GitHub and LiveBench.

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

2025-05-05 | Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu