SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

2024 | Shraman Pramanick, Rama Chellappa, Subhashini Venugopalan
SPIQA (Scientific Paper Image Question Answering) is a large-scale dataset designed to interpret complex figures and tables within the context of scientific research articles across various computer science domains. The dataset includes 270K questions divided into training, validation, and three evaluation splits. It leverages the capabilities of multimodal large language models (LLMs) to understand figures and tables, and employs both automatic and manual curation methods. SPIQA introduces three tasks: Direct QA with figures and tables, Direct QA with full paper, and CoT QA (Chain-of-Thought QA), which assesses the models' ability to reason step-by-step and integrate information from multiple figures, tables, and text. The dataset is evaluated using 12 prominent foundational models, and a novel evaluation metric, LLMLogScore (L3Score), is proposed to assess the quality of answers. The results show that fine-tuning models on SPIQA significantly improves their performance, highlighting the potential for developing specialized systems for scientific QA.SPIQA (Scientific Paper Image Question Answering) is a large-scale dataset designed to interpret complex figures and tables within the context of scientific research articles across various computer science domains. The dataset includes 270K questions divided into training, validation, and three evaluation splits. It leverages the capabilities of multimodal large language models (LLMs) to understand figures and tables, and employs both automatic and manual curation methods. SPIQA introduces three tasks: Direct QA with figures and tables, Direct QA with full paper, and CoT QA (Chain-of-Thought QA), which assesses the models' ability to reason step-by-step and integrate information from multiple figures, tables, and text. The dataset is evaluated using 12 prominent foundational models, and a novel evaluation metric, LLMLogScore (L3Score), is proposed to assess the quality of answers. The results show that fine-tuning models on SPIQA significantly improves their performance, highlighting the potential for developing specialized systems for scientific QA.
Reach us at info@study.space
Understanding SPIQA%3A A Dataset for Multimodal Question Answering on Scientific Papers