SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

12 Jul 2024 | Shraman Pramanick, Rama Chellappa, Subhashini Venugopalan
SPIQA is a large-scale dataset designed for multimodal question answering on scientific papers, focusing on interpreting complex figures and tables. The dataset includes 270,000 questions across training, validation, and three evaluation splits, covering a wide range of scientific domains. It was created through automatic and manual curation, involving tasks that require reasoning over multiple images, tables, and text. SPIQA evaluates the ability of multimodal systems to understand nuanced aspects of research articles and introduces a novel evaluation metric, LLMLogScore (L3Score), which assesses the semantic equivalence of answers based on log-likelihood probabilities. The dataset includes three well-designed tasks: direct QA with figures and tables, direct QA with full paper, and CoT QA. The results show that fine-tuning models on SPIQA improves their performance, highlighting the potential of the dataset for advancing scientific QA systems. SPIQA bridges the gap by including questions that require simultaneous reasoning over figures, tables, and text, enabling a more integrated understanding of scientific documents. The dataset also includes additional evaluation sets with human-written questions, enhancing its utility for benchmarking. The experiments demonstrate the effectiveness of L3Score in evaluating free-form QA, showing higher scores for better answers compared to traditional metrics. The dataset and metric contribute to the development of more accurate and robust systems for scientific literature understanding.SPIQA is a large-scale dataset designed for multimodal question answering on scientific papers, focusing on interpreting complex figures and tables. The dataset includes 270,000 questions across training, validation, and three evaluation splits, covering a wide range of scientific domains. It was created through automatic and manual curation, involving tasks that require reasoning over multiple images, tables, and text. SPIQA evaluates the ability of multimodal systems to understand nuanced aspects of research articles and introduces a novel evaluation metric, LLMLogScore (L3Score), which assesses the semantic equivalence of answers based on log-likelihood probabilities. The dataset includes three well-designed tasks: direct QA with figures and tables, direct QA with full paper, and CoT QA. The results show that fine-tuning models on SPIQA improves their performance, highlighting the potential of the dataset for advancing scientific QA systems. SPIQA bridges the gap by including questions that require simultaneous reasoning over figures, tables, and text, enabling a more integrated understanding of scientific documents. The dataset also includes additional evaluation sets with human-written questions, enhancing its utility for benchmarking. The experiments demonstrate the effectiveness of L3Score in evaluating free-form QA, showing higher scores for better answers compared to traditional metrics. The dataset and metric contribute to the development of more accurate and robust systems for scientific literature understanding.
Reach us at info@study.space
[slides and audio] SPIQA%3A A Dataset for Multimodal Question Answering on Scientific Papers