2 Jul 2024 | Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Yan, Wenjie Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen
MuirBench is a comprehensive benchmark designed to evaluate the multi-image understanding capabilities of multimodal large language models (LLMs). It consists of 11,264 images and 2,600 multiple-choice questions across 12 diverse multi-image tasks, covering 10 categories of multi-image relations. The benchmark is created in a pairwise manner, where each answerable instance is paired with an unanswerable variant, ensuring reliable assessment. Evaluations on 20 recent multimodal LLMs, including GPT-4o and Gemini Pro, reveal that even the best-performing models struggle with multi-image understanding, achieving only 68.0% and 49.3% accuracy, respectively. Open-source LLMs trained on single images perform even worse, with accuracy below 33.3%. These results highlight the need for more robust multi-image understanding in LLMs and suggest potential pathways for future improvements. MuirBench provides a rigorous framework for assessing and enhancing multi-image reasoning capabilities in LLMs.MuirBench is a comprehensive benchmark designed to evaluate the multi-image understanding capabilities of multimodal large language models (LLMs). It consists of 11,264 images and 2,600 multiple-choice questions across 12 diverse multi-image tasks, covering 10 categories of multi-image relations. The benchmark is created in a pairwise manner, where each answerable instance is paired with an unanswerable variant, ensuring reliable assessment. Evaluations on 20 recent multimodal LLMs, including GPT-4o and Gemini Pro, reveal that even the best-performing models struggle with multi-image understanding, achieving only 68.0% and 49.3% accuracy, respectively. Open-source LLMs trained on single images perform even worse, with accuracy below 33.3%. These results highlight the need for more robust multi-image understanding in LLMs and suggest potential pathways for future improvements. MuirBench provides a rigorous framework for assessing and enhancing multi-image reasoning capabilities in LLMs.