Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

22 May 2024 | Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Wiki-LLaVA is a hierarchical retrieval-augmented generation model designed to enhance the capabilities of multimodal large language models (MLLMs) by integrating external knowledge from documents. The model employs a hierarchical retrieval pipeline to access external knowledge, retrieving relevant passages that are then used as additional context for the LLM, thereby improving the accuracy and effectiveness of generated dialogues. The approach is tested on datasets tailored for visual question answering with external data, demonstrating the effectiveness of the method. The paper discusses the challenges faced by MLLMs in answering questions that require external knowledge, and proposes Wiki-LLaVA as a solution that augments MLLMs with retrieval capabilities. The model integrates a hierarchical retrieval module that retrieves relevant information from an external knowledge base, which is then fed into the MLLM without altering its structure. This allows the model to leverage external knowledge for more accurate answers. The proposed method includes a knowledge-based augmentation that enhances the model's ability to generate specific answers by incorporating retrieved information. The model uses a hierarchical retrieval strategy to find relevant documents and passages from an external knowledge base, which are then used as additional context for the MLLM. The model is trained on pairs of questions and ground-truth answers requiring external knowledge, and the results show that the model performs well in answering questions that require external knowledge. The experiments conducted on the Encyclopedic-VQA and InfoSeek datasets demonstrate that the model's performance is significantly improved when using external knowledge. The results show that the model can retrieve relevant information from the external knowledge base and use it to generate more accurate answers. The model's performance is also compared with other MLLMs, and it is shown that the model outperforms them in certain tasks. The paper also discusses the limitations of the proposed approach and suggests future research directions, such as improving the performance of the hierarchical retrieval and developing more efficient and sustainable paradigms for selecting documents. Overall, the work presents a new approach to enhancing MLLMs with external knowledge, which has the potential to improve their performance in tasks that require external information.Wiki-LLaVA is a hierarchical retrieval-augmented generation model designed to enhance the capabilities of multimodal large language models (MLLMs) by integrating external knowledge from documents. The model employs a hierarchical retrieval pipeline to access external knowledge, retrieving relevant passages that are then used as additional context for the LLM, thereby improving the accuracy and effectiveness of generated dialogues. The approach is tested on datasets tailored for visual question answering with external data, demonstrating the effectiveness of the method. The paper discusses the challenges faced by MLLMs in answering questions that require external knowledge, and proposes Wiki-LLaVA as a solution that augments MLLMs with retrieval capabilities. The model integrates a hierarchical retrieval module that retrieves relevant information from an external knowledge base, which is then fed into the MLLM without altering its structure. This allows the model to leverage external knowledge for more accurate answers. The proposed method includes a knowledge-based augmentation that enhances the model's ability to generate specific answers by incorporating retrieved information. The model uses a hierarchical retrieval strategy to find relevant documents and passages from an external knowledge base, which are then used as additional context for the MLLM. The model is trained on pairs of questions and ground-truth answers requiring external knowledge, and the results show that the model performs well in answering questions that require external knowledge. The experiments conducted on the Encyclopedic-VQA and InfoSeek datasets demonstrate that the model's performance is significantly improved when using external knowledge. The results show that the model can retrieve relevant information from the external knowledge base and use it to generate more accurate answers. The model's performance is also compared with other MLLMs, and it is shown that the model outperforms them in certain tasks. The paper also discusses the limitations of the proposed approach and suggests future research directions, such as improving the performance of the hierarchical retrieval and developing more efficient and sustainable paradigms for selecting documents. Overall, the work presents a new approach to enhancing MLLMs with external knowledge, which has the potential to improve their performance in tasks that require external information.
Reach us at info@study.space