16 Jan 2024 | Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou
This paper proposes a novel end-to-end generative framework for multi-modal knowledge retrieval, called GeMKR, which leverages large language models (LLMs) as virtual knowledge bases. The framework addresses the challenges of existing methods in terms of effectiveness and training efficiency, especially when handling multi-modal queries. GeMKR employs a two-step process: first, generating knowledge clues related to the queries, and second, retrieving relevant documents by searching databases using the knowledge clues. The framework introduces an object-aware prefix-tuning technique to guide multi-grained visual learning and aligns multi-grained visual features into the textual feature space of the LLM to capture cross-modal interactions. It also constructs instruction data with a unified format for model training and proposes a knowledge-guided generation strategy to impose prior constraints in the decoding steps, promoting the generation of distinctive knowledge clues. Experiments on three benchmarks show significant improvements in performance, ranging from 3.0% to 14.6% across all evaluation metrics compared to strong baselines. The results demonstrate that GeMKR can generalize well to large-scale knowledge sources. The framework is effective in multi-modal knowledge retrieval due to its ability to align visual representations with LLMs and introduce a constrained beam search guided by knowledge bases. The model is trained end-to-end without additional data, and it outperforms other baselines in terms of performance and efficiency. The experiments show that the model can effectively retrieve precise knowledge from large-scale knowledge bases. The results indicate that the proposed framework is well-suited for multi-modal knowledge retrieval tasks.This paper proposes a novel end-to-end generative framework for multi-modal knowledge retrieval, called GeMKR, which leverages large language models (LLMs) as virtual knowledge bases. The framework addresses the challenges of existing methods in terms of effectiveness and training efficiency, especially when handling multi-modal queries. GeMKR employs a two-step process: first, generating knowledge clues related to the queries, and second, retrieving relevant documents by searching databases using the knowledge clues. The framework introduces an object-aware prefix-tuning technique to guide multi-grained visual learning and aligns multi-grained visual features into the textual feature space of the LLM to capture cross-modal interactions. It also constructs instruction data with a unified format for model training and proposes a knowledge-guided generation strategy to impose prior constraints in the decoding steps, promoting the generation of distinctive knowledge clues. Experiments on three benchmarks show significant improvements in performance, ranging from 3.0% to 14.6% across all evaluation metrics compared to strong baselines. The results demonstrate that GeMKR can generalize well to large-scale knowledge sources. The framework is effective in multi-modal knowledge retrieval due to its ability to align visual representations with LLMs and introduce a constrained beam search guided by knowledge bases. The model is trained end-to-end without additional data, and it outperforms other baselines in terms of performance and efficiency. The experiments show that the model can effectively retrieve precise knowledge from large-scale knowledge bases. The results indicate that the proposed framework is well-suited for multi-modal knowledge retrieval tasks.