2 Jul 2024 | Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Yan, Wenjie Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muha Chen
MUIRBENCH is a comprehensive benchmark designed to evaluate the robust multi-image understanding capabilities of multimodal large language models (LLMs). The benchmark includes 11,264 images and 2,600 multiple-choice questions across 12 diverse multi-image tasks, such as scene understanding, ordering, and visual retrieval. Each task involves 10 categories of multi-image relations, including temporal, narrative, and complementary relations. The benchmark is created in a pairwise manner, where each answerable instance is paired with an unanswerable variant that has minimal semantic differences, ensuring a reliable assessment of models.
MUIRBENCH evaluates models on 20 recent multimodal LLMs, including GPT-4o and Gemini Pro, revealing that even the best-performing models achieve only 68.0% and 49.3% accuracy, respectively, which are significantly lower than human performance. Open-source models trained on single images perform poorly, with accuracy below 33.3%. These results highlight the importance of MUIRBENCH in encouraging the development of multimodal LLMs that can handle multi-image scenarios effectively.
The benchmark includes a variety of multi-image relations and provides a robust evaluation by incorporating unanswerable instances. It also includes diverse image types, such as slides, maps, medical images, and drone/satellite imagery, enhancing the comprehensiveness of the benchmark. The data is curated from multiple sources, including existing datasets, derived data, and newly collected data, ensuring a wide range of tasks and relations.
The benchmark's design allows for a comprehensive evaluation of models' abilities in multi-image understanding, including tasks like image-text matching, visual retrieval, and diagram understanding. It also evaluates models on unanswerable questions, highlighting their ability to recognize when they do not know the answer. The benchmark's results show that models trained on single images struggle with multi-image tasks, emphasizing the need for models that can synthesize and reason across multiple visual sources.
MUIRBENCH provides a rigorous framework for assessing multimodal LLMs, encouraging the development of models that can effectively handle multi-image scenarios. The benchmark's comprehensive design and diverse tasks make it a valuable resource for evaluating and improving the capabilities of multimodal LLMs in multi-image understanding.MUIRBENCH is a comprehensive benchmark designed to evaluate the robust multi-image understanding capabilities of multimodal large language models (LLMs). The benchmark includes 11,264 images and 2,600 multiple-choice questions across 12 diverse multi-image tasks, such as scene understanding, ordering, and visual retrieval. Each task involves 10 categories of multi-image relations, including temporal, narrative, and complementary relations. The benchmark is created in a pairwise manner, where each answerable instance is paired with an unanswerable variant that has minimal semantic differences, ensuring a reliable assessment of models.
MUIRBENCH evaluates models on 20 recent multimodal LLMs, including GPT-4o and Gemini Pro, revealing that even the best-performing models achieve only 68.0% and 49.3% accuracy, respectively, which are significantly lower than human performance. Open-source models trained on single images perform poorly, with accuracy below 33.3%. These results highlight the importance of MUIRBENCH in encouraging the development of multimodal LLMs that can handle multi-image scenarios effectively.
The benchmark includes a variety of multi-image relations and provides a robust evaluation by incorporating unanswerable instances. It also includes diverse image types, such as slides, maps, medical images, and drone/satellite imagery, enhancing the comprehensiveness of the benchmark. The data is curated from multiple sources, including existing datasets, derived data, and newly collected data, ensuring a wide range of tasks and relations.
The benchmark's design allows for a comprehensive evaluation of models' abilities in multi-image understanding, including tasks like image-text matching, visual retrieval, and diagram understanding. It also evaluates models on unanswerable questions, highlighting their ability to recognize when they do not know the answer. The benchmark's results show that models trained on single images struggle with multi-image tasks, emphasizing the need for models that can synthesize and reason across multiple visual sources.
MUIRBENCH provides a rigorous framework for assessing multimodal LLMs, encouraging the development of models that can effectively handle multi-image scenarios. The benchmark's comprehensive design and diverse tasks make it a valuable resource for evaluating and improving the capabilities of multimodal LLMs in multi-image understanding.