MLLM-COMPBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs

MLLM-COMPBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs

13 Jan 2025 | Jihyung Kil*, Zheda Mai*, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, Wei-Lun Chao
The paper introduces MLLM-COMPBENCH, a benchmark designed to evaluate the comparative reasoning capabilities of multimodal large language models (MLLMs). The benchmark focuses on eight dimensions of relative comparison: visual attributes, existence, state, emotion, temporality, spatiality, quantity, and quality. It curates around 40K image pairs from diverse vision datasets and CLIP similarity scores, covering a wide range of visual domains. The questions are crafted to discern relative characteristics between two images and are labeled by human annotators. The benchmark is used to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6, revealing notable shortcomings in their comparative abilities. The paper discusses the advantages of MLLM-COMPBENCH, its data curation process, and detailed experimental results, highlighting areas where current models struggle and suggesting future improvements.The paper introduces MLLM-COMPBENCH, a benchmark designed to evaluate the comparative reasoning capabilities of multimodal large language models (MLLMs). The benchmark focuses on eight dimensions of relative comparison: visual attributes, existence, state, emotion, temporality, spatiality, quantity, and quality. It curates around 40K image pairs from diverse vision datasets and CLIP similarity scores, covering a wide range of visual domains. The questions are crafted to discern relative characteristics between two images and are labeled by human annotators. The benchmark is used to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6, revealing notable shortcomings in their comparative abilities. The paper discusses the advantages of MLLM-COMPBENCH, its data curation process, and detailed experimental results, highlighting areas where current models struggle and suggesting future improvements.
Reach us at info@study.space
Understanding MLLM-CompBench%3A A Comparative Reasoning Benchmark for Multimodal LLMs