11 Jun 2024 | Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang, Bohao Yang, Wenhua Chen, Wenhao Huang, Noura Al Moubayed, Jie Fu, Chenghua Lin
**Abstract:**
Multi-modal information retrieval (MMIR) has seen significant advancements, particularly in image-text pairs, but current benchmarks often overlook the scientific domain, which has unique characteristics. To address this gap, we introduce SciMMIR, a scientific domain-specific MMIR benchmark. SciMMIR leverages open-access research papers to extract 530K image-text pairs from figures and tables with detailed captions. These pairs are annotated with a two-level subset-subcategory hierarchy to facilitate comprehensive evaluation. We evaluate zero-shot and fine-tuned models on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2. Additionally, we explore the impact of optical character recognition (OCR) on improving VLMs' performance. Our findings highlight the importance of domain-specific adaptation and the influence of different visual and textual encoders.
**Introduction:**
Information retrieval (IR) systems aim to provide relevant information from large datasets based on user queries. Advances in representation learning have evolved IR from lexical matching to similarity matching in learned representations, supporting additional modalities like images and audio. In scientific domains, offering fine-grained multi-modal retrieval is crucial. However, existing benchmarks primarily focus on generic topics, neglecting scientific data, which has distinct characteristics. This paper introduces SciMMIR, the first benchmark for evaluating MMIR in the scientific domain. It includes 530K image-text pairs from figures and tables, annotated with a two-level subset-subcategory hierarchy. We conduct extensive experiments on various models, revealing the challenges and improvements in scientific MMIR tasks.
**Related Work:**
General information retrieval has evolved from lexical matching to learned representations, with recent advancements in multi-modal representation learning. Scientific document retrieval has received moderate attention, but existing benchmarks often lack comprehensive coverage and real-world data.
**Dataset Construction:**
We collect PDF files from arXiv, extract figures and tables, and their captions. The dataset is split into training, validation, and testing sets. We define a hierarchical structure with two subsets (tables and figures) and five subcategories (e.g., experimental results, model architectures).
**Experiment:**
We evaluate image captioning models and visual language models on zero-shot and fine-tuned settings. OCR is used to improve VLMs' performance. Our results show that domain-specific adaptation significantly improves model performance, especially for figure-related tasks. Fine-tuned models outperform zero-shot models, and OCR enhances performance on table-related tasks.
**Conclusion:**
SciMMIR addresses the gap in evaluating MMIR in the scientific domain. Our findings highlight the importance of domain-specific adaptation and the impact of different encoders. The dataset and code are publicly available for further research.**Abstract:**
Multi-modal information retrieval (MMIR) has seen significant advancements, particularly in image-text pairs, but current benchmarks often overlook the scientific domain, which has unique characteristics. To address this gap, we introduce SciMMIR, a scientific domain-specific MMIR benchmark. SciMMIR leverages open-access research papers to extract 530K image-text pairs from figures and tables with detailed captions. These pairs are annotated with a two-level subset-subcategory hierarchy to facilitate comprehensive evaluation. We evaluate zero-shot and fine-tuned models on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2. Additionally, we explore the impact of optical character recognition (OCR) on improving VLMs' performance. Our findings highlight the importance of domain-specific adaptation and the influence of different visual and textual encoders.
**Introduction:**
Information retrieval (IR) systems aim to provide relevant information from large datasets based on user queries. Advances in representation learning have evolved IR from lexical matching to similarity matching in learned representations, supporting additional modalities like images and audio. In scientific domains, offering fine-grained multi-modal retrieval is crucial. However, existing benchmarks primarily focus on generic topics, neglecting scientific data, which has distinct characteristics. This paper introduces SciMMIR, the first benchmark for evaluating MMIR in the scientific domain. It includes 530K image-text pairs from figures and tables, annotated with a two-level subset-subcategory hierarchy. We conduct extensive experiments on various models, revealing the challenges and improvements in scientific MMIR tasks.
**Related Work:**
General information retrieval has evolved from lexical matching to learned representations, with recent advancements in multi-modal representation learning. Scientific document retrieval has received moderate attention, but existing benchmarks often lack comprehensive coverage and real-world data.
**Dataset Construction:**
We collect PDF files from arXiv, extract figures and tables, and their captions. The dataset is split into training, validation, and testing sets. We define a hierarchical structure with two subsets (tables and figures) and five subcategories (e.g., experimental results, model architectures).
**Experiment:**
We evaluate image captioning models and visual language models on zero-shot and fine-tuned settings. OCR is used to improve VLMs' performance. Our results show that domain-specific adaptation significantly improves model performance, especially for figure-related tasks. Fine-tuned models outperform zero-shot models, and OCR enhances performance on table-related tasks.
**Conclusion:**
SciMMIR addresses the gap in evaluating MMIR in the scientific domain. Our findings highlight the importance of domain-specific adaptation and the impact of different encoders. The dataset and code are publicly available for further research.