[slides] Generative Multi-Modal Knowledge Retrieval with Large Language Models

The paper introduces a novel end-to-end generative framework for multi-modal knowledge retrieval, addressing the challenges of effectiveness and training efficiency in existing methods. The framework leverages large language models (LLMs) as virtual knowledge bases, capable of effectively handling multi-modal queries through a two-step process: generating knowledge clues and obtaining relevant documents. Key contributions include an object-aware prefix-tuning technique for multi-grained visual learning, multi-modal alignment using LLMs to capture cross-modal interactions, and a knowledge-guided constraint decoding strategy to generate informative knowledge clues. Experiments on three benchmarks demonstrate significant improvements (3.0% to 14.6%) over strong baselines, highlighting the model's effectiveness in handling large-scale knowledge sources. The proposed framework, named GeMKR, shows superior performance in retrieving precise knowledge, especially in larger knowledge bases, and effectively captures cross-modal correlations, making it well-suited for multi-modal knowledge retrieval tasks.The paper introduces a novel end-to-end generative framework for multi-modal knowledge retrieval, addressing the challenges of effectiveness and training efficiency in existing methods. The framework leverages large language models (LLMs) as virtual knowledge bases, capable of effectively handling multi-modal queries through a two-step process: generating knowledge clues and obtaining relevant documents. Key contributions include an object-aware prefix-tuning technique for multi-grained visual learning, multi-modal alignment using LLMs to capture cross-modal interactions, and a knowledge-guided constraint decoding strategy to generate informative knowledge clues. Experiments on three benchmarks demonstrate significant improvements (3.0% to 14.6%) over strong baselines, highlighting the model's effectiveness in handling large-scale knowledge sources. The proposed framework, named GeMKR, shows superior performance in retrieving precise knowledge, especially in larger knowledge bases, and effectively captures cross-modal correlations, making it well-suited for multi-modal knowledge retrieval tasks.

Generative Multi-Modal Knowledge Retrieval with Large Language Models

16 Jan 2024 | Xinwei Long*1, Jiali Zeng2, Fandong Meng2, Zhiyuan Ma1, Kaiyan Zhang1, Bowen Zhou†1, Jie Zhou2