5 Jun 2024 | Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne
**PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers**
Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne
Department of Engineering
University of Cambridge
Cambridge, United Kingdom CB2 1PZ
{wl356, jm2245, jc2124, wjb31}@cam.ac.uk
**Abstract**
Large Multimodal Models (LLMs) excel in natural language and visual understanding but struggle with tasks like Knowledge-based Visual Question Answering (KB-VQA), which require retrieving relevant information from document collections. We present M2KR, an extensive training and evaluation framework for KB-VQA, which includes a suite of vision and language tasks. Using M2KR, we develop PreFLMR, a pre-trained version of the Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, achieving state-of-the-art results across various tasks. We also investigate the scaling behaviors of PreFLMR, providing insights for future developments in general-purpose multi-modal retrievers. The code, demo, dataset, and pre-trained checkpoints are available at <https://preflmr.github.io/>.
KB-VQA systems generate answers to queries about images by accessing relevant world knowledge, vision, and language understanding. Despite their strengths in vision and language, LLMs often perform poorly in KB-VQA tasks. Retrieval-Augmented Generation (RAG) improves performance by grounding answer generation in relevant documents from a knowledge base. FLMR, a fine-grained late-interaction multi-modal retrieval approach, has shown superior performance over Dense Passage Retrieval (DPR) on various KB-VQA tasks. This paper investigates three aspects of FLMR: vision and text encoding, pre-training, and task diversity.
- **Vision & Text Encoding**: We explore how KB-VQA performance is affected by scaling the size and complexity of vision and text encoders.
- **Pre-training**: We investigate whether gains can be achieved through more extensive model pre-training.
- **Task Diversity**: We gather nine open-source vision-language datasets into the M2KR benchmark suite for assessing multi-task multi-modal knowledge retrieval.
We show that M2KR can be used to train an FLMR-based RAG LLM for multi-task multi-modal retrieval, referred to as PreFLMR. PreFLMR performs well across a range of knowledge retrieval tasks when given appropriate instructions. We release PreFLMR upon publication.
Our contributions include:
- The M2KR task suite, encompassing nine datasets and three types of retrieval tasks.
- PreFLMR, a strong multi-modal retriever pre-trained on a vision-language corpus of over ten million items.
- A study of the scaling behavior of FLMR in terms of model parameters and training data, providing empirical guidance for future work.
**Document Retrieval**: DPR has become a cornerstone in knowledge-intensive tasks,**PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers**
Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne
Department of Engineering
University of Cambridge
Cambridge, United Kingdom CB2 1PZ
{wl356, jm2245, jc2124, wjb31}@cam.ac.uk
**Abstract**
Large Multimodal Models (LLMs) excel in natural language and visual understanding but struggle with tasks like Knowledge-based Visual Question Answering (KB-VQA), which require retrieving relevant information from document collections. We present M2KR, an extensive training and evaluation framework for KB-VQA, which includes a suite of vision and language tasks. Using M2KR, we develop PreFLMR, a pre-trained version of the Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, achieving state-of-the-art results across various tasks. We also investigate the scaling behaviors of PreFLMR, providing insights for future developments in general-purpose multi-modal retrievers. The code, demo, dataset, and pre-trained checkpoints are available at <https://preflmr.github.io/>.
KB-VQA systems generate answers to queries about images by accessing relevant world knowledge, vision, and language understanding. Despite their strengths in vision and language, LLMs often perform poorly in KB-VQA tasks. Retrieval-Augmented Generation (RAG) improves performance by grounding answer generation in relevant documents from a knowledge base. FLMR, a fine-grained late-interaction multi-modal retrieval approach, has shown superior performance over Dense Passage Retrieval (DPR) on various KB-VQA tasks. This paper investigates three aspects of FLMR: vision and text encoding, pre-training, and task diversity.
- **Vision & Text Encoding**: We explore how KB-VQA performance is affected by scaling the size and complexity of vision and text encoders.
- **Pre-training**: We investigate whether gains can be achieved through more extensive model pre-training.
- **Task Diversity**: We gather nine open-source vision-language datasets into the M2KR benchmark suite for assessing multi-task multi-modal knowledge retrieval.
We show that M2KR can be used to train an FLMR-based RAG LLM for multi-task multi-modal retrieval, referred to as PreFLMR. PreFLMR performs well across a range of knowledge retrieval tasks when given appropriate instructions. We release PreFLMR upon publication.
Our contributions include:
- The M2KR task suite, encompassing nine datasets and three types of retrieval tasks.
- PreFLMR, a strong multi-modal retriever pre-trained on a vision-language corpus of over ten million items.
- A study of the scaling behavior of FLMR in terms of model parameters and training data, providing empirical guidance for future work.
**Document Retrieval**: DPR has become a cornerstone in knowledge-intensive tasks,