VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

3 Mar 2025 | Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
VLMEvalKit is an open-source toolkit designed for evaluating large multi-modality models (LMMs) using PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to assess existing LMMs and publish reproducible evaluation results. It supports over 200 different LMMs, including both proprietary APIs and open-source models, and more than 80 multi-modal benchmarks covering a wide range of tasks and scenarios. The toolkit simplifies the integration of new benchmarks or LMMs through a single interface and handles data preparation, distributed inference, prediction post-processing, and metric calculation automatically. It employs generation-based evaluation to ensure fair comparisons, especially for multi-choice questions, by using large language models (LLMs) for answer extraction. The toolkit also includes a leaderboard to track the progress of LMM development. VLMEvalKit is publicly available on GitHub under the Apache 2.0 License and is actively maintained. The toolkit's design is compatible with future updates that incorporate additional modalities, such as audio and video.VLMEvalKit is an open-source toolkit designed for evaluating large multi-modality models (LMMs) using PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to assess existing LMMs and publish reproducible evaluation results. It supports over 200 different LMMs, including both proprietary APIs and open-source models, and more than 80 multi-modal benchmarks covering a wide range of tasks and scenarios. The toolkit simplifies the integration of new benchmarks or LMMs through a single interface and handles data preparation, distributed inference, prediction post-processing, and metric calculation automatically. It employs generation-based evaluation to ensure fair comparisons, especially for multi-choice questions, by using large language models (LLMs) for answer extraction. The toolkit also includes a leaderboard to track the progress of LMM development. VLMEvalKit is publicly available on GitHub under the Apache 2.0 License and is actively maintained. The toolkit's design is compatible with future updates that incorporate additional modalities, such as audio and video.
Reach us at info@study.space