17 Aug 2024 | Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, Soujanya Poria
The paper introduces PUZZLEVQA, a dataset of 2000 puzzle instances based on abstract patterns to evaluate the reasoning abilities of large multimodal models. The puzzles focus on fundamental concepts such as colors, numbers, shapes, and sizes, and are designed to test visual perception, inductive reasoning, and deductive reasoning. Experiments with state-of-the-art models, including GPT-4V, reveal that they struggle with these abstract patterns, achieving only 46.4% accuracy on single-concept puzzles. The main bottlenecks identified are weaker visual perception and inductive reasoning. The paper also provides ground truth reasoning explanations to guide the models and diagnose their reasoning challenges. The results highlight the limitations of current models in emulating human cognitive processes and suggest areas for future research to enhance multimodal reasoning capabilities.The paper introduces PUZZLEVQA, a dataset of 2000 puzzle instances based on abstract patterns to evaluate the reasoning abilities of large multimodal models. The puzzles focus on fundamental concepts such as colors, numbers, shapes, and sizes, and are designed to test visual perception, inductive reasoning, and deductive reasoning. Experiments with state-of-the-art models, including GPT-4V, reveal that they struggle with these abstract patterns, achieving only 46.4% accuracy on single-concept puzzles. The main bottlenecks identified are weaker visual perception and inductive reasoning. The paper also provides ground truth reasoning explanations to guide the models and diagnose their reasoning challenges. The results highlight the limitations of current models in emulating human cognitive processes and suggest areas for future research to enhance multimodal reasoning capabilities.