NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

5 Mar 2024 | Lizhou Fan, Wenyue Hua, Xiang Li, Kaijie Zhu, Mingyu Jin, Lingyao Li, Haoyang Ling, Jinkui Chi, Jindong Wang, Xin Ma, Yongfeng Zhang
NPHardEval4V is a dynamic benchmark designed to evaluate the reasoning abilities of Multimodal Large Language Models (MLLMs). It addresses the gap in assessing pure reasoning capabilities by converting textual descriptions of problems from NPHardEval into visual representations. The benchmark includes tasks categorized into polynomial time (P), NP-complete, and NP-hard problems, with varying difficulty levels. It aims to disentangle the effects of factors like image recognition and instruction following from overall model performance, focusing solely on reasoning abilities. Unlike static benchmarks, NPHardEval4V is updated monthly to prevent overfitting and ensure a more accurate evaluation of models. The benchmark provides a rigorous framework for assessing MLLMs' reasoning abilities using computational complexity hierarchy. The dataset and code are available at https://github.com/lizhouf/NPHardEval4V. The benchmark evaluates MLLMs on tasks such as graph coloring, shortest path, and knapsack problems, using both visual and textual inputs. It investigates the impact of different prompt types, including visual, text, and combined visual-text prompts, on MLLMs' reasoning performance. The results show that MLLMs generally perform worse than Large Language Models (LLMs) in reasoning tasks, with significant differences in performance across complexity levels. The benchmark also highlights the importance of visual inputs in enhancing reasoning abilities, with some models like Gemini performing better with text-only or vision-rich-text prompts. The evaluation metrics include Recognition Accuracy (RA), Instruction-following Effective Rate (ER), and Aggregated Accuracy (AA), which assess the models' ability to recognize inputs, follow instructions, and provide accurate answers. The findings indicate that MLLMs lag behind LLMs in reasoning tasks, emphasizing the need for further research to improve their reasoning capabilities. The benchmark's dynamic nature ensures that assessments remain relevant and challenging, fostering models that can learn and adapt rather than merely optimizing for static benchmarks.NPHardEval4V is a dynamic benchmark designed to evaluate the reasoning abilities of Multimodal Large Language Models (MLLMs). It addresses the gap in assessing pure reasoning capabilities by converting textual descriptions of problems from NPHardEval into visual representations. The benchmark includes tasks categorized into polynomial time (P), NP-complete, and NP-hard problems, with varying difficulty levels. It aims to disentangle the effects of factors like image recognition and instruction following from overall model performance, focusing solely on reasoning abilities. Unlike static benchmarks, NPHardEval4V is updated monthly to prevent overfitting and ensure a more accurate evaluation of models. The benchmark provides a rigorous framework for assessing MLLMs' reasoning abilities using computational complexity hierarchy. The dataset and code are available at https://github.com/lizhouf/NPHardEval4V. The benchmark evaluates MLLMs on tasks such as graph coloring, shortest path, and knapsack problems, using both visual and textual inputs. It investigates the impact of different prompt types, including visual, text, and combined visual-text prompts, on MLLMs' reasoning performance. The results show that MLLMs generally perform worse than Large Language Models (LLMs) in reasoning tasks, with significant differences in performance across complexity levels. The benchmark also highlights the importance of visual inputs in enhancing reasoning abilities, with some models like Gemini performing better with text-only or vision-rich-text prompts. The evaluation metrics include Recognition Accuracy (RA), Instruction-following Effective Rate (ER), and Aggregated Accuracy (AA), which assess the models' ability to recognize inputs, follow instructions, and provide accurate answers. The findings indicate that MLLMs lag behind LLMs in reasoning tasks, emphasizing the need for further research to improve their reasoning capabilities. The benchmark's dynamic nature ensures that assessments remain relevant and challenging, fostering models that can learn and adapt rather than merely optimizing for static benchmarks.
Reach us at info@study.space