6 Jul 2024 | Yijia Xiao, Edward Sun, Tianyu Liu, Wei Wang
LogicVista is a benchmark designed to evaluate the logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. The benchmark includes 448 multiple-choice questions across five logical reasoning tasks, covering inductive, deductive, numerical, spatial, and mechanical reasoning. Each question is annotated with the correct answer and the reasoning behind it, enabling both open-ended and multiple-choice evaluations. The benchmark includes a variety of capabilities such as diagrams, OCR, patterns, graphs, tables, 3D shapes, puzzles, sequences, and physics. All images, instructions, solutions, and reasoning are manually annotated and validated. LogicVista uses an LLM-based evaluation approach to assess the reasoning skills of MLLMs, allowing for quantitative analysis of model performance. The benchmark evaluates 8 MLLMs across 5 logical reasoning categories, providing a comprehensive assessment of their performance. The results show that models perform best on deductive, numerical, and mechanical reasoning tasks, while induction and spatial reasoning are less frequently encountered in training data. The benchmark highlights the importance of logical reasoning in complex tasks and provides a platform for evaluating the reasoning capabilities of MLLMs. The benchmark also emphasizes the need for enhanced training and evaluation methodologies that prioritize reasoning tasks to improve the logical reasoning capabilities of multimodal LLMs.LogicVista is a benchmark designed to evaluate the logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. The benchmark includes 448 multiple-choice questions across five logical reasoning tasks, covering inductive, deductive, numerical, spatial, and mechanical reasoning. Each question is annotated with the correct answer and the reasoning behind it, enabling both open-ended and multiple-choice evaluations. The benchmark includes a variety of capabilities such as diagrams, OCR, patterns, graphs, tables, 3D shapes, puzzles, sequences, and physics. All images, instructions, solutions, and reasoning are manually annotated and validated. LogicVista uses an LLM-based evaluation approach to assess the reasoning skills of MLLMs, allowing for quantitative analysis of model performance. The benchmark evaluates 8 MLLMs across 5 logical reasoning categories, providing a comprehensive assessment of their performance. The results show that models perform best on deductive, numerical, and mechanical reasoning tasks, while induction and spatial reasoning are less frequently encountered in training data. The benchmark highlights the importance of logical reasoning in complex tasks and provides a platform for evaluating the reasoning capabilities of MLLMs. The benchmark also emphasizes the need for enhanced training and evaluation methodologies that prioritize reasoning tasks to improve the logical reasoning capabilities of multimodal LLMs.