24 Apr 2024 | Yifan Jiang, Jiarui Zhang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara
MARVEL is a multidimensional abstract visual reasoning (AVR) benchmark designed to evaluate large language models (LLMs) on various patterns, input shapes, and task configurations. The benchmark includes 770 puzzles, each with six core knowledge patterns, geometric and abstract shapes, and five different task configurations. It introduces a hierarchical evaluation framework that combines perception questions with AVR questions to assess models' ability to understand visual details and perform abstract reasoning. The benchmark also includes perception questions to test models' ability to comprehend visual features and spatial relationships, which are essential for abstract reasoning.
The benchmark was developed to address the limitations of existing AVR benchmarks, which often focus on a limited set of patterns and input shapes. MARVEL provides a more comprehensive evaluation by incorporating a wide range of patterns and task configurations, allowing for a more accurate assessment of models' reasoning abilities. The benchmark was evaluated using nine representative MLLMs in zero-shot and few-shot settings, revealing that all models showed near-random performance on AVR questions, with a significant performance gap (40%) compared to humans. Further analysis of perception questions revealed that MLLMs struggled to comprehend visual features and count panels in the puzzles, hindering their ability for abstract reasoning.
The benchmark also includes a hierarchical evaluation framework that enables fine-grained diagnosis of model capabilities. This framework includes perception questions that test models' ability to understand visual details, such as the number of panels, edges of shapes, and spatial relationships. The results of the experiments show that MLLMs often fail to understand these details, leading to poor performance on AVR tasks. The benchmark also highlights the importance of visual perception in abstract reasoning, as models that struggle with visual details are unable to perform well on AVR tasks.
Overall, the results of the experiments indicate that MLLMs have significant limitations in abstract visual reasoning, particularly in understanding visual details and spatial relationships. The benchmark provides a comprehensive evaluation of MLLMs on AVR tasks, highlighting the need for further research to improve their ability to perform abstract reasoning. The benchmark also emphasizes the importance of visual perception in abstract reasoning, as models that struggle with visual details are unable to perform well on AVR tasks. The results of the experiments suggest that future research should focus on improving the visual perception abilities of MLLMs to enhance their performance on AVR tasks.MARVEL is a multidimensional abstract visual reasoning (AVR) benchmark designed to evaluate large language models (LLMs) on various patterns, input shapes, and task configurations. The benchmark includes 770 puzzles, each with six core knowledge patterns, geometric and abstract shapes, and five different task configurations. It introduces a hierarchical evaluation framework that combines perception questions with AVR questions to assess models' ability to understand visual details and perform abstract reasoning. The benchmark also includes perception questions to test models' ability to comprehend visual features and spatial relationships, which are essential for abstract reasoning.
The benchmark was developed to address the limitations of existing AVR benchmarks, which often focus on a limited set of patterns and input shapes. MARVEL provides a more comprehensive evaluation by incorporating a wide range of patterns and task configurations, allowing for a more accurate assessment of models' reasoning abilities. The benchmark was evaluated using nine representative MLLMs in zero-shot and few-shot settings, revealing that all models showed near-random performance on AVR questions, with a significant performance gap (40%) compared to humans. Further analysis of perception questions revealed that MLLMs struggled to comprehend visual features and count panels in the puzzles, hindering their ability for abstract reasoning.
The benchmark also includes a hierarchical evaluation framework that enables fine-grained diagnosis of model capabilities. This framework includes perception questions that test models' ability to understand visual details, such as the number of panels, edges of shapes, and spatial relationships. The results of the experiments show that MLLMs often fail to understand these details, leading to poor performance on AVR tasks. The benchmark also highlights the importance of visual perception in abstract reasoning, as models that struggle with visual details are unable to perform well on AVR tasks.
Overall, the results of the experiments indicate that MLLMs have significant limitations in abstract visual reasoning, particularly in understanding visual details and spatial relationships. The benchmark provides a comprehensive evaluation of MLLMs on AVR tasks, highlighting the need for further research to improve their ability to perform abstract reasoning. The benchmark also emphasizes the importance of visual perception in abstract reasoning, as models that struggle with visual details are unable to perform well on AVR tasks. The results of the experiments suggest that future research should focus on improving the visual perception abilities of MLLMs to enhance their performance on AVR tasks.