2024-6-14 | Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Dee Guo, Sreenivas Gollapudi, Ahmed Qureshi
The paper introduces ReMI, a dataset designed to assess large language models (LLMs) in multi-image reasoning tasks. Multi-image reasoning involves integrating information from multiple images to solve problems across various domains such as math, physics, logic, and spatial reasoning. The dataset includes 13 tasks that cover a wide range of domains and properties, such as sequential vs set consumption of image information, same vs different concepts, and interleaving of images with text. The authors evaluate state-of-the-art LLMs on ReMI and find a significant gap between their performance and human proficiency, highlighting the need for further research. The analysis reveals strengths and weaknesses of different models, providing insights into the types of reasoning that are currently achievable and areas for improvement. The paper also discusses the performance of models when images are provided separately versus all together, noting that feeding images separately can sometimes improve performance. A detailed failure analysis identifies common error sources, including calculation errors, image reading errors, and reasoning errors. The study concludes by suggesting that future work should focus on improving LLMs in the identified areas and measuring the impact on ReMI performance.The paper introduces ReMI, a dataset designed to assess large language models (LLMs) in multi-image reasoning tasks. Multi-image reasoning involves integrating information from multiple images to solve problems across various domains such as math, physics, logic, and spatial reasoning. The dataset includes 13 tasks that cover a wide range of domains and properties, such as sequential vs set consumption of image information, same vs different concepts, and interleaving of images with text. The authors evaluate state-of-the-art LLMs on ReMI and find a significant gap between their performance and human proficiency, highlighting the need for further research. The analysis reveals strengths and weaknesses of different models, providing insights into the types of reasoning that are currently achievable and areas for improvement. The paper also discusses the performance of models when images are provided separately versus all together, noting that feeding images separately can sometimes improve performance. A detailed failure analysis identifies common error sources, including calculation errors, image reading errors, and reasoning errors. The study concludes by suggesting that future work should focus on improving LLMs in the identified areas and measuring the impact on ReMI performance.