PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

5 Jun 2024 | Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne
PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers This paper presents PreFLMR, a pretrained version of the Fine-Grained Late-Interaction Multi-modal Retriever (FLMR) approach for Knowledge-based Visual Question Answering (KB-VQA). We introduce M2KR, a benchmark suite for training and evaluating general-purpose multi-modal retrievers. M2KR includes nine datasets and three types of retrieval tasks. PreFLMR is pretrained on a vision-language corpus of over ten million items and performs well across a range of knowledge retrieval tasks when given appropriate instructions. We investigate the scaling behaviors of PreFLMR, including vision and text encoding, pre-training, and task diversity. Our results show that PreFLMR achieves substantial gains across the M2KR tasks. We also evaluate PreFLMR on downstream KB-VQA tasks and find that it improves performance by approximately 6% on OKVQA, 9% on Infoseek, and 34% on E-VQA compared to models without retrieval. PreFLMR is also effective in document retrieval for KB-VQA tasks. We contribute a comprehensive training and evaluation framework, M2KR, for general-purpose multi-modal knowledge retrieval. The PreFLMR system we train in the M2KR framework yields excellent retrieval performance across a range of tasks and can also serve as a base for further task-specific fine-tuning.PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers This paper presents PreFLMR, a pretrained version of the Fine-Grained Late-Interaction Multi-modal Retriever (FLMR) approach for Knowledge-based Visual Question Answering (KB-VQA). We introduce M2KR, a benchmark suite for training and evaluating general-purpose multi-modal retrievers. M2KR includes nine datasets and three types of retrieval tasks. PreFLMR is pretrained on a vision-language corpus of over ten million items and performs well across a range of knowledge retrieval tasks when given appropriate instructions. We investigate the scaling behaviors of PreFLMR, including vision and text encoding, pre-training, and task diversity. Our results show that PreFLMR achieves substantial gains across the M2KR tasks. We also evaluate PreFLMR on downstream KB-VQA tasks and find that it improves performance by approximately 6% on OKVQA, 9% on Infoseek, and 34% on E-VQA compared to models without retrieval. PreFLMR is also effective in document retrieval for KB-VQA tasks. We contribute a comprehensive training and evaluation framework, M2KR, for general-purpose multi-modal knowledge retrieval. The PreFLMR system we train in the M2KR framework yields excellent retrieval performance across a range of tasks and can also serve as a base for further task-specific fine-tuning.
Reach us at info@study.space