SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

11 Jun 2024 | Siwei Wu, Yizhi Li, Kang Zhu, Ge Zhang, Yiming Liang, Kaijing Ma, Chenghao Xiao, Haoran Zhang, Bohao Yang, Wenhu Chen, Wenhao Huang, Noura Al Moubayed, Jie Fu, Chenguha Lin
SciMMIR is a scientific multi-modal information retrieval benchmark designed to evaluate the performance of models in retrieving scientific information from image-text pairs. The benchmark is constructed by extracting image-text pairs from open-access research papers, resulting in 530,000 meticulously curated pairs. These pairs are annotated with a two-level subset-subcategory hierarchy to facilitate comprehensive evaluation of retrieval systems. The benchmark includes both image and text data, with a focus on scientific domains where captions describe experimental results or scientific principles rather than general scenes or activities. The benchmark evaluates models in both zero-shot and fine-tuned settings, using models like CLIP, BLIP, and BLIP-2. Optical character recognition (OCR) is applied to images to enhance the performance of visual language models (VLMs) on the SciMMIR task. The findings show that domain-specific adaptation significantly improves performance, with BLIP-2 models generally outperforming other pre-trained VLMs. The results also highlight the importance of text encoders and the impact of OCR data on model performance. The benchmark provides insights into the challenges of multi-modal information retrieval in scientific domains and the effectiveness of domain adaptation. The dataset and code are publicly available for further research.SciMMIR is a scientific multi-modal information retrieval benchmark designed to evaluate the performance of models in retrieving scientific information from image-text pairs. The benchmark is constructed by extracting image-text pairs from open-access research papers, resulting in 530,000 meticulously curated pairs. These pairs are annotated with a two-level subset-subcategory hierarchy to facilitate comprehensive evaluation of retrieval systems. The benchmark includes both image and text data, with a focus on scientific domains where captions describe experimental results or scientific principles rather than general scenes or activities. The benchmark evaluates models in both zero-shot and fine-tuned settings, using models like CLIP, BLIP, and BLIP-2. Optical character recognition (OCR) is applied to images to enhance the performance of visual language models (VLMs) on the SciMMIR task. The findings show that domain-specific adaptation significantly improves performance, with BLIP-2 models generally outperforming other pre-trained VLMs. The results also highlight the importance of text encoders and the impact of OCR data on model performance. The benchmark provides insights into the challenges of multi-modal information retrieval in scientific domains and the effectiveness of domain adaptation. The dataset and code are publicly available for further research.
Reach us at info@study.space
[slides] SciMMIR%3A Benchmarking Scientific Multi-modal Information Retrieval | StudySpace