VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

3 Mar 2025 | Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
VLMEvalKit is an open-source toolkit for evaluating large multi-modality models, built on PyTorch. It provides a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. The toolkit supports over 200 large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 multi-modal benchmarks. It offers a single interface for adding new models, simplifying tasks such as data preparation, distributed inference, prediction post-processing, and metric calculation. The toolkit is compatible with future updates that incorporate additional modalities, such as audio and video. Based on evaluation results, the OpenVLM Leaderboard is maintained to track progress in multi-modality learning research. The toolkit is available on GitHub under the Apache 2.0 License and is actively maintained. VLMEvalKit supports major commercial APIs and over 200 open-source LMMs, as well as more than 80 multi-modal benchmarks covering a wide range of tasks and scenarios. The toolkit simplifies the integration of new benchmarks or LMMs, allowing users to launch evaluations across multiple supported LMMs and benchmarks with a single command, generating well-structured evaluation results. It employs generation-based evaluation across all LMMs and benchmarks, using large language models as choice extractors when exact matching fails, thereby mitigating the impact of response styles and enhancing evaluation reliability. VLMEvalKit also supports circular evaluation for multi-choice benchmarks to better assess real comprehension. The toolkit has been used to evaluate LMMs on general VQA benchmarks and image reasoning benchmarks, showing that open-source LMMs now demonstrate strong capabilities in general understanding tasks, often matching or even surpassing the performance of commercial APIs. The toolkit is designed to extend its capabilities beyond the image modality, and has recently incorporated the evaluation of a video understanding benchmark, MMBench-Video. Future development will focus on expanding the repertoire of LMMs and benchmarks for video and other modalities.VLMEvalKit is an open-source toolkit for evaluating large multi-modality models, built on PyTorch. It provides a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. The toolkit supports over 200 large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 multi-modal benchmarks. It offers a single interface for adding new models, simplifying tasks such as data preparation, distributed inference, prediction post-processing, and metric calculation. The toolkit is compatible with future updates that incorporate additional modalities, such as audio and video. Based on evaluation results, the OpenVLM Leaderboard is maintained to track progress in multi-modality learning research. The toolkit is available on GitHub under the Apache 2.0 License and is actively maintained. VLMEvalKit supports major commercial APIs and over 200 open-source LMMs, as well as more than 80 multi-modal benchmarks covering a wide range of tasks and scenarios. The toolkit simplifies the integration of new benchmarks or LMMs, allowing users to launch evaluations across multiple supported LMMs and benchmarks with a single command, generating well-structured evaluation results. It employs generation-based evaluation across all LMMs and benchmarks, using large language models as choice extractors when exact matching fails, thereby mitigating the impact of response styles and enhancing evaluation reliability. VLMEvalKit also supports circular evaluation for multi-choice benchmarks to better assess real comprehension. The toolkit has been used to evaluate LMMs on general VQA benchmarks and image reasoning benchmarks, showing that open-source LMMs now demonstrate strong capabilities in general understanding tasks, often matching or even surpassing the performance of commercial APIs. The toolkit is designed to extend its capabilities beyond the image modality, and has recently incorporated the evaluation of a video understanding benchmark, MMBench-Video. Future development will focus on expanding the repertoire of LMMs and benchmarks for video and other modalities.
Reach us at info@study.space
[slides] VLMEvalKit%3A An Open-Source Toolkit for Evaluating Large Multi-Modality Models | StudySpace