MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player?

20 Aug 2024 | Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen
MMBench: Is Your Multi-modal Model an All-around Player? MMBench is a bilingual benchmark for evaluating the multi-modal capabilities of large vision-language models (VLMs). It systematically develops a comprehensive evaluation pipeline with key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing benchmarks in terms of the number and variety of evaluation questions and abilities. 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. It contains over 3000 multiple-choice questions covering 20 different ability dimensions, such as object localization and social reasoning, for evaluating vision-language models. Each ability dimension encompasses over 125 questions, with the quantity of questions per ability maintained at a roughly equal level. The distribution facilitates a balanced and thorough assessment of these abilities. MMBench is designed to robustly evaluate different abilities of large vision-language models. It introduces a novel circular evaluation strategy (CircularEval) to improve the robustness of the evaluation process. After that, GPT-4 is employed to match the model's prediction with given choices, which can successfully extract choices even from predictions of a VLM with poor instruction-following capability. MMBench comprehensively evaluates 21 well-known vision-language models (across different model architectures and scales) on MMBench and reports their performance on different ability dimensions. The performance ranking offers a direct comparison between various models and provides valuable feedback for future optimization. MMBench's main contributions are three-fold: systematically-constructed dataset, robust evaluation, and analysis and observations. MMBench is a multi-modal benchmark that performs objective evaluation for VLMs with over 3,000 multiple-choice questions covering 20 ability dimensions. To produce robust and reliable evaluation results, MMBench introduces a new evaluation strategy named CircularEval. The strategy is much stricter than the vanilla 1-pass evaluation and can yield reliable evaluation results at an affordable cost. Considering the limited instruction following ability of some VLMs, to yield more accurate evaluation results, MMBench additionally adopts LLMs to extract choices from the model's predictions. MMBench comprehensively evaluates over 20 mainstream VLMs on MMBench, covering different architectures and parameter sizes. The evaluation results provide valuable insights for future improvements. MMBench is a bilingual multi-modal benchmark and enables an apple-to-apple comparison of VLM performance under English and Chinese contexts. MMBench contains over 3000 multiple-choice questions covering 20 differentMMBench: Is Your Multi-modal Model an All-around Player? MMBench is a bilingual benchmark for evaluating the multi-modal capabilities of large vision-language models (VLMs). It systematically develops a comprehensive evaluation pipeline with key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing benchmarks in terms of the number and variety of evaluation questions and abilities. 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. It contains over 3000 multiple-choice questions covering 20 different ability dimensions, such as object localization and social reasoning, for evaluating vision-language models. Each ability dimension encompasses over 125 questions, with the quantity of questions per ability maintained at a roughly equal level. The distribution facilitates a balanced and thorough assessment of these abilities. MMBench is designed to robustly evaluate different abilities of large vision-language models. It introduces a novel circular evaluation strategy (CircularEval) to improve the robustness of the evaluation process. After that, GPT-4 is employed to match the model's prediction with given choices, which can successfully extract choices even from predictions of a VLM with poor instruction-following capability. MMBench comprehensively evaluates 21 well-known vision-language models (across different model architectures and scales) on MMBench and reports their performance on different ability dimensions. The performance ranking offers a direct comparison between various models and provides valuable feedback for future optimization. MMBench's main contributions are three-fold: systematically-constructed dataset, robust evaluation, and analysis and observations. MMBench is a multi-modal benchmark that performs objective evaluation for VLMs with over 3,000 multiple-choice questions covering 20 ability dimensions. To produce robust and reliable evaluation results, MMBench introduces a new evaluation strategy named CircularEval. The strategy is much stricter than the vanilla 1-pass evaluation and can yield reliable evaluation results at an affordable cost. Considering the limited instruction following ability of some VLMs, to yield more accurate evaluation results, MMBench additionally adopts LLMs to extract choices from the model's predictions. MMBench comprehensively evaluates over 20 mainstream VLMs on MMBench, covering different architectures and parameter sizes. The evaluation results provide valuable insights for future improvements. MMBench is a bilingual multi-modal benchmark and enables an apple-to-apple comparison of VLM performance under English and Chinese contexts. MMBench contains over 3000 multiple-choice questions covering 20 different
Reach us at info@study.space