The paper introduces NPHardEval4V, a dynamic benchmark designed to evaluate the pure reasoning capabilities of Multimodal Large Language Models (MLLMs). The benchmark aims to disentangle the effects of image recognition and instruction following from overall model performance, focusing solely on reasoning abilities. It is built by converting textual descriptions of problems from NPHardEval into visual representations, covering polynomial time, NP-complete, and NP-hard problems. The study reveals significant discrepancies in reasoning abilities across different models, highlighting the relatively weak performance of MLLMs compared to Large Language Models (LLMs) in reasoning tasks. The impact of different prompting styles, including visual, text, and combined visual-text prompts, on MLLMs' reasoning abilities is also investigated. Unlike traditional benchmarks, NPHardEval4V is updated monthly to prevent overfitting and ensure a more authentic evaluation. The benchmark dataset and code are available at <https://github.com/lizhouf/NPHardEval4V>.
The evolution of MLLMs marks a significant milestone in the pursuit of artificial general intelligence (AGI), enhancing multimedia interaction systems and cross-modal decision-making tools. Reasoning is a critical ability for MLLMs, enabling them to understand complex relationships between different modalities and make informed decisions. However, existing benchmarks often combine factors like recognition and instruction following, making it difficult to assess pure reasoning abilities. NPHardEval4V addresses this by providing a dynamic framework that updates regularly to prevent overfitting and ensure a comprehensive evaluation.
The benchmark is built upon the NPHardEval framework, which segments tasks into three computational complexity classes: P, NP-complete, and NP-hard. Each class includes problems with varying difficulty levels, allowing for a nuanced assessment of model performance. The benchmark transforms textual descriptions into visual representations, providing both textual and visual information to evaluate MLLMs' reasoning abilities. The experimental setup includes recognition and reasoning experiments, with metrics such as Recognition Accuracy (RA), Instruction-following Effective Rate (ER), and Aggregated Accuracy (AA) used to assess performance.
The results show that close-source models like Gemini outperform open-source models in all tasks, regardless of complexity. The reasoning capabilities of MLLMs are inversely proportional to task complexity, with a clear downtrend observed from simpler P problems to NP-hard problems. The Gemini model stands out with superior performance in text-only and vision-rich-text setups, indicating its advanced ability to process and integrate textual information. The study also highlights the importance of prompt design and the need for further research to enhance MLLMs' textual understanding and multimodal integration.
The paper concludes by emphasizing the need for dynamic and stringent testing to deepen our understanding of MLLMs' capabilities and constraints. It calls for further research in areas such as longitudinal learning studies, expanding reasoning taxonomies, and harmonizing model evolution with benchmarks. The study underscores the importance of addressing the limitations of current models and the need for more advanced reasoning capabilities in MThe paper introduces NPHardEval4V, a dynamic benchmark designed to evaluate the pure reasoning capabilities of Multimodal Large Language Models (MLLMs). The benchmark aims to disentangle the effects of image recognition and instruction following from overall model performance, focusing solely on reasoning abilities. It is built by converting textual descriptions of problems from NPHardEval into visual representations, covering polynomial time, NP-complete, and NP-hard problems. The study reveals significant discrepancies in reasoning abilities across different models, highlighting the relatively weak performance of MLLMs compared to Large Language Models (LLMs) in reasoning tasks. The impact of different prompting styles, including visual, text, and combined visual-text prompts, on MLLMs' reasoning abilities is also investigated. Unlike traditional benchmarks, NPHardEval4V is updated monthly to prevent overfitting and ensure a more authentic evaluation. The benchmark dataset and code are available at <https://github.com/lizhouf/NPHardEval4V>.
The evolution of MLLMs marks a significant milestone in the pursuit of artificial general intelligence (AGI), enhancing multimedia interaction systems and cross-modal decision-making tools. Reasoning is a critical ability for MLLMs, enabling them to understand complex relationships between different modalities and make informed decisions. However, existing benchmarks often combine factors like recognition and instruction following, making it difficult to assess pure reasoning abilities. NPHardEval4V addresses this by providing a dynamic framework that updates regularly to prevent overfitting and ensure a comprehensive evaluation.
The benchmark is built upon the NPHardEval framework, which segments tasks into three computational complexity classes: P, NP-complete, and NP-hard. Each class includes problems with varying difficulty levels, allowing for a nuanced assessment of model performance. The benchmark transforms textual descriptions into visual representations, providing both textual and visual information to evaluate MLLMs' reasoning abilities. The experimental setup includes recognition and reasoning experiments, with metrics such as Recognition Accuracy (RA), Instruction-following Effective Rate (ER), and Aggregated Accuracy (AA) used to assess performance.
The results show that close-source models like Gemini outperform open-source models in all tasks, regardless of complexity. The reasoning capabilities of MLLMs are inversely proportional to task complexity, with a clear downtrend observed from simpler P problems to NP-hard problems. The Gemini model stands out with superior performance in text-only and vision-rich-text setups, indicating its advanced ability to process and integrate textual information. The study also highlights the importance of prompt design and the need for further research to enhance MLLMs' textual understanding and multimodal integration.
The paper concludes by emphasizing the need for dynamic and stringent testing to deepen our understanding of MLLMs' capabilities and constraints. It calls for further research in areas such as longitudinal learning studies, expanding reasoning taxonomies, and harmonizing model evolution with benchmarks. The study underscores the importance of addressing the limitations of current models and the need for more advanced reasoning capabilities in M