[slides] Multimodal ArXiv%3A A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

The paper introduces Multimodal ArXiv, a dataset designed to enhance the scientific comprehension of large vision-language models (LVLMs). The dataset consists of two parts: ArXivCap and ArXivQA. ArXivCap is a figure-caption dataset containing 6.4 million images and 3.9 million captions from 572,000 ArXiv papers across various scientific domains. ArXivQA is a question-answering dataset generated by prompting GPT-4V based on scientific figures, aiming to improve LVLMs' mathematical reasoning capabilities. The dataset is evaluated through experiments, showing a 10.4% absolute accuracy gain on the Math-Vista benchmark. Additionally, the paper defines four vision-to-text tasks using ArXivCap to benchmark LVLMs' ability to understand scientific figures. The results highlight the challenges LVLMs face in understanding abstract figures and the effectiveness of domain-specific training in improving performance. The error analysis reveals common issues such as misinterpretation of visual context, recognition errors, and overly simplified captions, providing insights for future improvements.The paper introduces Multimodal ArXiv, a dataset designed to enhance the scientific comprehension of large vision-language models (LVLMs). The dataset consists of two parts: ArXivCap and ArXivQA. ArXivCap is a figure-caption dataset containing 6.4 million images and 3.9 million captions from 572,000 ArXiv papers across various scientific domains. ArXivQA is a question-answering dataset generated by prompting GPT-4V based on scientific figures, aiming to improve LVLMs' mathematical reasoning capabilities. The dataset is evaluated through experiments, showing a 10.4% absolute accuracy gain on the Math-Vista benchmark. Additionally, the paper defines four vision-to-text tasks using ArXivCap to benchmark LVLMs' ability to understand scientific figures. The results highlight the challenges LVLMs face in understanding abstract figures and the effectiveness of domain-specific training in improving performance. The error analysis reveals common issues such as misinterpretation of visual context, recognition errors, and overly simplified captions, providing insights for future improvements.

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

2 Jun 2024 | Lei Li†, Yuqi Wang†‡, Runxin Xu†, Peiyi Wang† Xiachong Feng†, Lingpeng Kong†, Qi Liu†