Understanding PreFLMR%3A Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

**PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers** Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne Department of Engineering University of Cambridge Cambridge, United Kingdom CB2 1PZ {wl356, jm2245, jc2124, wjb31}@cam.ac.uk **Abstract** Large Multimodal Models (LLMs) excel in natural language and visual understanding but struggle with tasks like Knowledge-based Visual Question Answering (KB-VQA), which require retrieving relevant information from document collections. We present M2KR, an extensive training and evaluation framework for KB-VQA, which includes a suite of vision and language tasks. Using M2KR, we develop PreFLMR, a pre-trained version of the Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, achieving state-of-the-art results across various tasks. We also investigate the scaling behaviors of PreFLMR, providing insights for future developments in general-purpose multi-modal retrievers. The code, demo, dataset, and pre-trained checkpoints are available at <https://preflmr.github.io/>. KB-VQA systems generate answers to queries about images by accessing relevant world knowledge, vision, and language understanding. Despite their strengths in vision and language, LLMs often perform poorly in KB-VQA tasks. Retrieval-Augmented Generation (RAG) improves performance by grounding answer generation in relevant documents from a knowledge base. FLMR, a fine-grained late-interaction multi-modal retrieval approach, has shown superior performance over Dense Passage Retrieval (DPR) on various KB-VQA tasks. This paper investigates three aspects of FLMR: vision and text encoding, pre-training, and task diversity. - **Vision & Text Encoding**: We explore how KB-VQA performance is affected by scaling the size and complexity of vision and text encoders. - **Pre-training**: We investigate whether gains can be achieved through more extensive model pre-training. - **Task Diversity**: We gather nine open-source vision-language datasets into the M2KR benchmark suite for assessing multi-task multi-modal knowledge retrieval. We show that M2KR can be used to train an FLMR-based RAG LLM for multi-task multi-modal retrieval, referred to as PreFLMR. PreFLMR performs well across a range of knowledge retrieval tasks when given appropriate instructions. We release PreFLMR upon publication. Our contributions include: - The M2KR task suite, encompassing nine datasets and three types of retrieval tasks. - PreFLMR, a strong multi-modal retriever pre-trained on a vision-language corpus of over ten million items. - A study of the scaling behavior of FLMR in terms of model parameters and training data, providing empirical guidance for future work. **Document Retrieval**: DPR has become a cornerstone in knowledge-intensive tasks,**PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers** Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne Department of Engineering University of Cambridge Cambridge, United Kingdom CB2 1PZ {wl356, jm2245, jc2124, wjb31}@cam.ac.uk **Abstract** Large Multimodal Models (LLMs) excel in natural language and visual understanding but struggle with tasks like Knowledge-based Visual Question Answering (KB-VQA), which require retrieving relevant information from document collections. We present M2KR, an extensive training and evaluation framework for KB-VQA, which includes a suite of vision and language tasks. Using M2KR, we develop PreFLMR, a pre-trained version of the Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, achieving state-of-the-art results across various tasks. We also investigate the scaling behaviors of PreFLMR, providing insights for future developments in general-purpose multi-modal retrievers. The code, demo, dataset, and pre-trained checkpoints are available at <https://preflmr.github.io/>. KB-VQA systems generate answers to queries about images by accessing relevant world knowledge, vision, and language understanding. Despite their strengths in vision and language, LLMs often perform poorly in KB-VQA tasks. Retrieval-Augmented Generation (RAG) improves performance by grounding answer generation in relevant documents from a knowledge base. FLMR, a fine-grained late-interaction multi-modal retrieval approach, has shown superior performance over Dense Passage Retrieval (DPR) on various KB-VQA tasks. This paper investigates three aspects of FLMR: vision and text encoding, pre-training, and task diversity. - **Vision & Text Encoding**: We explore how KB-VQA performance is affected by scaling the size and complexity of vision and text encoders. - **Pre-training**: We investigate whether gains can be achieved through more extensive model pre-training. - **Task Diversity**: We gather nine open-source vision-language datasets into the M2KR benchmark suite for assessing multi-task multi-modal knowledge retrieval. We show that M2KR can be used to train an FLMR-based RAG LLM for multi-task multi-modal retrieval, referred to as PreFLMR. PreFLMR performs well across a range of knowledge retrieval tasks when given appropriate instructions. We release PreFLMR upon publication. Our contributions include: - The M2KR task suite, encompassing nine datasets and three types of retrieval tasks. - PreFLMR, a strong multi-modal retriever pre-trained on a vision-language corpus of over ten million items. - A study of the scaling behavior of FLMR in terms of model parameters and training data, providing empirical guidance for future work. **Document Retrieval**: DPR has become a cornerstone in knowledge-intensive tasks,

PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

5 Jun 2024 | Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne