2024-6-14 | Mehran Kazemi¹, Nishanth Dikkala², Ankit Anand¹, Petar Dević³, Isha Dasgupta¹, Fangyu Liu¹, Bahare Fatemi², Pranjal Awasthi², Dee Guo², Sreenivas Gollapudi² and Ahmed Qureshi³
ReMI is a new benchmark for multi-image reasoning, designed to evaluate large language models (LLMs) in tasks that require reasoning across multiple images. The dataset includes 13 tasks across various domains such as math, physics, logic, coding, and spatial reasoning, covering a wide range of properties specific to multi-image reasoning. These tasks involve reasoning over multiple images, often requiring the integration of information from text and images. The dataset was benchmarked against several state-of-the-art LLMs, revealing a significant gap between model performance and human proficiency. The results show that models perform substantially worse than humans, especially in tasks requiring complex reasoning. The analysis also highlights the strengths and weaknesses of different models, indicating areas where future models need improvement. ReMI is open-sourced to encourage further research in multi-image reasoning. The dataset includes tasks that require reasoning with different concepts, interleaved images, and varying numbers of images. The experiments show that models perform better when images are provided separately rather than in a single image, especially when images are interleaved with text. The failure analysis reveals various issues, including calculation errors, misreading of values, and incorrect reasoning. The results also show that different models perform well on different tasks, indicating that current state-of-the-art models have different capabilities and limitations. The study concludes that there is significant room for improvement in the reasoning capabilities of current LLMs, and further research is needed to address the identified limitations.ReMI is a new benchmark for multi-image reasoning, designed to evaluate large language models (LLMs) in tasks that require reasoning across multiple images. The dataset includes 13 tasks across various domains such as math, physics, logic, coding, and spatial reasoning, covering a wide range of properties specific to multi-image reasoning. These tasks involve reasoning over multiple images, often requiring the integration of information from text and images. The dataset was benchmarked against several state-of-the-art LLMs, revealing a significant gap between model performance and human proficiency. The results show that models perform substantially worse than humans, especially in tasks requiring complex reasoning. The analysis also highlights the strengths and weaknesses of different models, indicating areas where future models need improvement. ReMI is open-sourced to encourage further research in multi-image reasoning. The dataset includes tasks that require reasoning with different concepts, interleaved images, and varying numbers of images. The experiments show that models perform better when images are provided separately rather than in a single image, especially when images are interleaved with text. The failure analysis reveals various issues, including calculation errors, misreading of values, and incorrect reasoning. The results also show that different models perform well on different tasks, indicating that current state-of-the-art models have different capabilities and limitations. The study concludes that there is significant room for improvement in the reasoning capabilities of current LLMs, and further research is needed to address the identified limitations.