[slides and audio] LogicVista%3A Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

LogicVista is a comprehensive evaluation benchmark designed to assess the logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. The benchmark covers five categories of logical reasoning tasks—inductive, deductive, numerical, spatial, and mechanical—and includes 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluations. The benchmark evaluates the performance of 8 MLLMs, including LLaVA, MiniGPT4, Otter, GPT-4 Vision, BLIP-2, and InstructBLIP. The evaluation setup uses an LLM-based multiple-choice answer extractor to compare MLLM outputs with the ground truth answers. The results indicate that while models perform better on deductive, numerical, and mechanical reasoning tasks, they often struggle with induction and spatial reasoning tasks. The benchmark highlights the need for enhanced training and evaluation methodologies that prioritize reasoning tasks to improve the logical reasoning capabilities of MLLMs.LogicVista is a comprehensive evaluation benchmark designed to assess the logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. The benchmark covers five categories of logical reasoning tasks—inductive, deductive, numerical, spatial, and mechanical—and includes 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluations. The benchmark evaluates the performance of 8 MLLMs, including LLaVA, MiniGPT4, Otter, GPT-4 Vision, BLIP-2, and InstructBLIP. The evaluation setup uses an LLM-based multiple-choice answer extractor to compare MLLM outputs with the ground truth answers. The results indicate that while models perform better on deductive, numerical, and mechanical reasoning tasks, they often struggle with induction and spatial reasoning tasks. The benchmark highlights the need for enhanced training and evaluation methodologies that prioritize reasoning tasks to improve the logical reasoning capabilities of MLLMs.

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

6 Jul 2024 | Yijia Xiao1, Edward Sun1, Tianyu Liu2, Wei Wang1

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

6 Jul 2024 | Yijia Xiao1*, Edward Sun1*, Tianyu Liu2, Wei Wang1

6 Jul 2024 | Yijia Xiao1, Edward Sun1, Tianyu Liu2, Wei Wang1