MLLM-COMPENBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs

MLLM-COMPENBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs

13 Jan 2025 | Jiyoung Kil, Zheda Mai*, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, Wei-Lun Chao
MLLM-COMPBENCH is a benchmark designed to evaluate the comparative reasoning capabilities of multimodal large language models (MLLMs). It focuses on assessing MLLMs' ability to compare objects, scenes, or situations across eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. The benchmark includes 39.8K triplets, each containing two visually or semantically relevant images, a question about their relativity, and a ground-truth answer. The dataset is curated from diverse vision datasets and CLIP similarity scores, covering a wide range of visual domains including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. The benchmark evaluates recent MLLMs, including GPT-4V, Gemini-Pro, and LLaVA-1.6, revealing notable shortcomings in their comparative abilities. The results highlight the need for further research to improve MLLMs' comparative reasoning capabilities. MLLM-COMPBENCH provides a comprehensive benchmark for assessing the comparative reasoning abilities of MLLMs and offers insights for future improvements. The benchmark is extensible, with multiple data sources that can be further incorporated. The work also explores the importance of comparative reasoning in daily decision-making and problem-solving, and highlights the need for MLLMs to improve in this area. The benchmark is evaluated on various tasks, including existence, state, emotion, temporality, spatiality, quantity, and quality, with results showing that current MLLMs face challenges in answering relative questions. The benchmark also includes a two-stage reasoning approach, where MLLMs are asked to analyze each image separately before answering a follow-up question. The results show that this approach can reduce performance due to challenges in absolute inference and reasoning. The benchmark also includes fine-tuning experiments, where LLaVA-1.6 is fine-tuned on specific datasets, showing improvements in certain tasks but limited gains in others. The benchmark also includes error analysis, highlighting the challenges MLLMs face in tasks such as differentiating colors, counting small or distant objects, identifying objects in crowded scenes, and recognizing out-of-focus details. The benchmark also includes human evaluation, showing that current MLLMs lag behind human performance in several relativities. The benchmark also evaluates recent MLLMs released after the NeurIPS deadline, showing improvements in certain tasks but remaining mediocre in others. The work concludes that MLLM-COMPBENCH provides a valuable tool for advancing comparative reasoning in MLLMs and highlights the need for further research in this area.MLLM-COMPBENCH is a benchmark designed to evaluate the comparative reasoning capabilities of multimodal large language models (MLLMs). It focuses on assessing MLLMs' ability to compare objects, scenes, or situations across eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. The benchmark includes 39.8K triplets, each containing two visually or semantically relevant images, a question about their relativity, and a ground-truth answer. The dataset is curated from diverse vision datasets and CLIP similarity scores, covering a wide range of visual domains including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. The benchmark evaluates recent MLLMs, including GPT-4V, Gemini-Pro, and LLaVA-1.6, revealing notable shortcomings in their comparative abilities. The results highlight the need for further research to improve MLLMs' comparative reasoning capabilities. MLLM-COMPBENCH provides a comprehensive benchmark for assessing the comparative reasoning abilities of MLLMs and offers insights for future improvements. The benchmark is extensible, with multiple data sources that can be further incorporated. The work also explores the importance of comparative reasoning in daily decision-making and problem-solving, and highlights the need for MLLMs to improve in this area. The benchmark is evaluated on various tasks, including existence, state, emotion, temporality, spatiality, quantity, and quality, with results showing that current MLLMs face challenges in answering relative questions. The benchmark also includes a two-stage reasoning approach, where MLLMs are asked to analyze each image separately before answering a follow-up question. The results show that this approach can reduce performance due to challenges in absolute inference and reasoning. The benchmark also includes fine-tuning experiments, where LLaVA-1.6 is fine-tuned on specific datasets, showing improvements in certain tasks but limited gains in others. The benchmark also includes error analysis, highlighting the challenges MLLMs face in tasks such as differentiating colors, counting small or distant objects, identifying objects in crowded scenes, and recognizing out-of-focus details. The benchmark also includes human evaluation, showing that current MLLMs lag behind human performance in several relativities. The benchmark also evaluates recent MLLMs released after the NeurIPS deadline, showing improvements in certain tasks but remaining mediocre in others. The work concludes that MLLM-COMPBENCH provides a valuable tool for advancing comparative reasoning in MLLMs and highlights the need for further research in this area.
Reach us at info@futurestudyspace.com