Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

2 Jun 2024 | Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, Qi Liu
Multimodal ArXiv is a dataset designed to enhance the scientific comprehension of large vision-language models (LVLMs). It consists of two components: ArXivCap and ArXivQA. ArXivCap is a figure-caption dataset containing 6.4 million images and 3.9 million captions from 572,000 ArXiv papers across various scientific domains. ArXivQA is a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA significantly improves the mathematical reasoning capabilities of open-source LVLMs, achieving a 10.4% absolute accuracy gain on a multimodal mathematical reasoning benchmark. The dataset also includes four vision-to-text tasks for benchmarking LVLMs. Evaluation results show that domain-specific training on ArXivCap leads to substantial performance improvements. Error analysis reveals that LVLMs struggle with the nuanced semantics of academic figures, leading to misinterpretations of visual context, recognition errors, and overly simplified captions. The study highlights the importance of domain-specific training and the need for more contextual information to improve LVLM performance. The dataset is a valuable resource for improving and benchmarking LVLMs in scientific reasoning tasks. The study also discusses the limitations of the dataset, including the potential oversight of diverse disciplines and data modalities in the broader scientific literature. Future research could incorporate a broader range of datasets and domains to enrich the coverage of scientific knowledge. The study concludes that the proposed dataset effectively enhances the scientific comprehension of LVLMs, particularly in mathematical reasoning tasks.Multimodal ArXiv is a dataset designed to enhance the scientific comprehension of large vision-language models (LVLMs). It consists of two components: ArXivCap and ArXivQA. ArXivCap is a figure-caption dataset containing 6.4 million images and 3.9 million captions from 572,000 ArXiv papers across various scientific domains. ArXivQA is a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA significantly improves the mathematical reasoning capabilities of open-source LVLMs, achieving a 10.4% absolute accuracy gain on a multimodal mathematical reasoning benchmark. The dataset also includes four vision-to-text tasks for benchmarking LVLMs. Evaluation results show that domain-specific training on ArXivCap leads to substantial performance improvements. Error analysis reveals that LVLMs struggle with the nuanced semantics of academic figures, leading to misinterpretations of visual context, recognition errors, and overly simplified captions. The study highlights the importance of domain-specific training and the need for more contextual information to improve LVLM performance. The dataset is a valuable resource for improving and benchmarking LVLMs in scientific reasoning tasks. The study also discusses the limitations of the dataset, including the potential oversight of diverse disciplines and data modalities in the broader scientific literature. Future research could incorporate a broader range of datasets and domains to enrich the coverage of scientific knowledge. The study concludes that the proposed dataset effectively enhances the scientific comprehension of LVLMs, particularly in mathematical reasoning tasks.
Reach us at info@study.space