14 May 2024 | Jonathan Roberts, Kai Han, Neil Houlsby, Samuel Albanie
The paper introduces SciFIBench, a benchmark for evaluating large multimodal models (LMMs) in the domain of scientific figure interpretation. The benchmark consists of 1000 multiple-choice questions split between two tasks across 12 categories, curated from arXiv paper figures and captions. The questions are designed to be challenging, with adversarial filtering to find hard negatives and human verification to ensure quality. The authors evaluate 26 LMMs on SciFIBench, finding it to be a difficult benchmark. They also investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets. The paper concludes by releasing SciFIBench to encourage further research in this domain. Key findings include the effectiveness of closed-source models, the challenge of adversarially selected negatives, and the performance differences between tasks. The paper also discusses the robustness of the benchmark and the importance of instruction-following abilities in LMMs.The paper introduces SciFIBench, a benchmark for evaluating large multimodal models (LMMs) in the domain of scientific figure interpretation. The benchmark consists of 1000 multiple-choice questions split between two tasks across 12 categories, curated from arXiv paper figures and captions. The questions are designed to be challenging, with adversarial filtering to find hard negatives and human verification to ensure quality. The authors evaluate 26 LMMs on SciFIBench, finding it to be a difficult benchmark. They also investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets. The paper concludes by releasing SciFIBench to encourage further research in this domain. Key findings include the effectiveness of closed-source models, the challenge of adversarially selected negatives, and the performance differences between tasks. The paper also discusses the robustness of the benchmark and the importance of instruction-following abilities in LMMs.