20 Aug 2024 | Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin
MMBench is a comprehensive and objective benchmark designed to evaluate the multi-modal capabilities of large vision-language models (VLMs). It addresses the limitations of existing benchmarks, such as VQA2 and COCO Caption, which lack fine-grained ability assessment and robust evaluation metrics, and subjective benchmarks like OwlEval, which are scalable and biased. MMBench features a systematic evaluation pipeline with over 3,217 multiple-choice questions covering 20 ability dimensions, including object localization, social reasoning, and attribute reasoning. Key features of MMBench include:
1. **Curated Dataset**: MMBench is meticulously curated with a diverse set of questions and abilities, surpassing existing benchmarks in terms of quantity and variety.
2. **CircularEval Strategy**: This strategy involves feeding a question to a VLM multiple times with shuffled choices and checking if the VLM succeeds in all attempts, providing a more rigorous and robust evaluation.
3. **Bilingual Support**: MMBench includes both English and Chinese versions, enabling an apples-to-apples comparison of VLM performance under different language contexts.
The evaluation results of MMBench on 21 well-known VLMs (across different architectures and scales) provide valuable insights for future research and model optimization. The main contributions of MMBench are:
- **Systematically Constructed Dataset**: A curated dataset with 3,217 questions covering 20 fine-grained skills.
- **Robust Evaluation**: Introduction of the CircularEval strategy to improve evaluation robustness.
- **Analysis and Observations**: Comprehensive evaluation of VLMs on MMBench, offering insights for future improvement.
MMBench aims to facilitate better evaluation of VLMs and promote progress in the field of multi-modal understanding.MMBench is a comprehensive and objective benchmark designed to evaluate the multi-modal capabilities of large vision-language models (VLMs). It addresses the limitations of existing benchmarks, such as VQA2 and COCO Caption, which lack fine-grained ability assessment and robust evaluation metrics, and subjective benchmarks like OwlEval, which are scalable and biased. MMBench features a systematic evaluation pipeline with over 3,217 multiple-choice questions covering 20 ability dimensions, including object localization, social reasoning, and attribute reasoning. Key features of MMBench include:
1. **Curated Dataset**: MMBench is meticulously curated with a diverse set of questions and abilities, surpassing existing benchmarks in terms of quantity and variety.
2. **CircularEval Strategy**: This strategy involves feeding a question to a VLM multiple times with shuffled choices and checking if the VLM succeeds in all attempts, providing a more rigorous and robust evaluation.
3. **Bilingual Support**: MMBench includes both English and Chinese versions, enabling an apples-to-apples comparison of VLM performance under different language contexts.
The evaluation results of MMBench on 21 well-known VLMs (across different architectures and scales) provide valuable insights for future research and model optimization. The main contributions of MMBench are:
- **Systematically Constructed Dataset**: A curated dataset with 3,217 questions covering 20 fine-grained skills.
- **Robust Evaluation**: Introduction of the CircularEval strategy to improve evaluation robustness.
- **Analysis and Observations**: Comprehensive evaluation of VLMs on MMBench, offering insights for future improvement.
MMBench aims to facilitate better evaluation of VLMs and promote progress in the field of multi-modal understanding.