SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

14 May 2024 | Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie
SciFIBench is a benchmark for evaluating the ability of large multimodal models (LMMs) to interpret scientific figures. The benchmark consists of 1000 multiple-choice questions across 12 categories, curated from arXiv paper figures and captions using adversarial filtering and human verification. The questions are designed to be challenging and ensure high-quality, answerable questions. The benchmark includes three subsets: gold (1000 questions), silver (10,000 questions), and bronze (174,000 questions). The gold subset is used for evaluation, while the silver and bronze subsets are used for downstream tasks such as hyperparameter tuning and fine-tuning. The benchmark evaluates 26 LMMs, including both closed-source and open-source models, on two core tasks: Figure → Caption and Caption → Figure. The results show that closed-source models outperform open-source models. GPT-4o and Gemini-Pro 1.5 perform best, outperforming all vision-language model (VLM) baselines but being beaten by the human baseline. Adversarial filtering increases question difficulty, but human verification is crucial for ensuring high-quality, answerable questions. Leveraging a strong LLM to evaluate the output of evaluated LMMs proves accurate and viable for automatic evaluation. The benchmark also investigates the alignment and reasoning faithfulness of LMMs on augmented question sets. The results show that the best-performing models achieve high accuracy on both tasks, with the Figure → Caption task being slightly easier than the Caption → Figure task. The performance of models varies significantly, with some models showing better performance on certain tasks. The benchmark highlights the importance of high-quality, answerable questions and the need for robust evaluation methods. The benchmark is released to encourage progress in the domain of scientific figure interpretation and understanding. The results demonstrate that there is significant room for improvement in the capabilities of LMMs to interpret scientific figures. The benchmark provides a quantitative evaluation of LMMs for the task of understanding scientific figures, which has not been previously reported. The benchmark also re-frames the task to a multiple-choice setting, which is more suitable for robust evaluation of LMMs. The benchmark is designed to be a challenging and comprehensive evaluation of LMMs in the domain of scientific figure interpretation.SciFIBench is a benchmark for evaluating the ability of large multimodal models (LMMs) to interpret scientific figures. The benchmark consists of 1000 multiple-choice questions across 12 categories, curated from arXiv paper figures and captions using adversarial filtering and human verification. The questions are designed to be challenging and ensure high-quality, answerable questions. The benchmark includes three subsets: gold (1000 questions), silver (10,000 questions), and bronze (174,000 questions). The gold subset is used for evaluation, while the silver and bronze subsets are used for downstream tasks such as hyperparameter tuning and fine-tuning. The benchmark evaluates 26 LMMs, including both closed-source and open-source models, on two core tasks: Figure → Caption and Caption → Figure. The results show that closed-source models outperform open-source models. GPT-4o and Gemini-Pro 1.5 perform best, outperforming all vision-language model (VLM) baselines but being beaten by the human baseline. Adversarial filtering increases question difficulty, but human verification is crucial for ensuring high-quality, answerable questions. Leveraging a strong LLM to evaluate the output of evaluated LMMs proves accurate and viable for automatic evaluation. The benchmark also investigates the alignment and reasoning faithfulness of LMMs on augmented question sets. The results show that the best-performing models achieve high accuracy on both tasks, with the Figure → Caption task being slightly easier than the Caption → Figure task. The performance of models varies significantly, with some models showing better performance on certain tasks. The benchmark highlights the importance of high-quality, answerable questions and the need for robust evaluation methods. The benchmark is released to encourage progress in the domain of scientific figure interpretation and understanding. The results demonstrate that there is significant room for improvement in the capabilities of LMMs to interpret scientific figures. The benchmark provides a quantitative evaluation of LMMs for the task of understanding scientific figures, which has not been previously reported. The benchmark also re-frames the task to a multiple-choice setting, which is more suitable for robust evaluation of LMMs. The benchmark is designed to be a challenging and comprehensive evaluation of LMMs in the domain of scientific figure interpretation.
Reach us at info@study.space