[slides and audio] MMBench%3A Is Your Multi-modal Model an All-around Player%3F

MMBench is a comprehensive and objective benchmark designed to evaluate the multi-modal capabilities of large vision-language models (VLMs). It addresses the limitations of existing benchmarks, such as VQA2 and COCO Caption, which lack fine-grained ability assessment and robust evaluation metrics, and subjective benchmarks like OwlEval, which are scalable and biased. MMBench features a systematic evaluation pipeline with over 3,217 multiple-choice questions covering 20 ability dimensions, including object localization, social reasoning, and attribute reasoning. Key features of MMBench include: 1. **Curated Dataset**: MMBench is meticulously curated with a diverse set of questions and abilities, surpassing existing benchmarks in terms of quantity and variety. 2. **CircularEval Strategy**: This strategy involves feeding a question to a VLM multiple times with shuffled choices and checking if the VLM succeeds in all attempts, providing a more rigorous and robust evaluation. 3. **Bilingual Support**: MMBench includes both English and Chinese versions, enabling an apples-to-apples comparison of VLM performance under different language contexts. The evaluation results of MMBench on 21 well-known VLMs (across different architectures and scales) provide valuable insights for future research and model optimization. The main contributions of MMBench are: - **Systematically Constructed Dataset**: A curated dataset with 3,217 questions covering 20 fine-grained skills. - **Robust Evaluation**: Introduction of the CircularEval strategy to improve evaluation robustness. - **Analysis and Observations**: Comprehensive evaluation of VLMs on MMBench, offering insights for future improvement. MMBench aims to facilitate better evaluation of VLMs and promote progress in the field of multi-modal understanding.MMBench is a comprehensive and objective benchmark designed to evaluate the multi-modal capabilities of large vision-language models (VLMs). It addresses the limitations of existing benchmarks, such as VQA2 and COCO Caption, which lack fine-grained ability assessment and robust evaluation metrics, and subjective benchmarks like OwlEval, which are scalable and biased. MMBench features a systematic evaluation pipeline with over 3,217 multiple-choice questions covering 20 ability dimensions, including object localization, social reasoning, and attribute reasoning. Key features of MMBench include: 1. **Curated Dataset**: MMBench is meticulously curated with a diverse set of questions and abilities, surpassing existing benchmarks in terms of quantity and variety. 2. **CircularEval Strategy**: This strategy involves feeding a question to a VLM multiple times with shuffled choices and checking if the VLM succeeds in all attempts, providing a more rigorous and robust evaluation. 3. **Bilingual Support**: MMBench includes both English and Chinese versions, enabling an apples-to-apples comparison of VLM performance under different language contexts. The evaluation results of MMBench on 21 well-known VLMs (across different architectures and scales) provide valuable insights for future research and model optimization. The main contributions of MMBench are: - **Systematically Constructed Dataset**: A curated dataset with 3,217 questions covering 20 fine-grained skills. - **Robust Evaluation**: Introduction of the CircularEval strategy to improve evaluation robustness. - **Analysis and Observations**: Comprehensive evaluation of VLMs on MMBench, offering insights for future improvement. MMBench aims to facilitate better evaluation of VLMs and promote progress in the field of multi-modal understanding.

MMBench: Is Your Multi-modal Model an All-around Player?

20 Aug 2024 | Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin