22 May 2024 | Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
**Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs**
This paper introduces Wiki-LLaVA, a novel approach to enhance Multimodal Large Language Models (MLLMs) by integrating external knowledge from multimodal documents. The primary goal is to enable MLLMs to answer questions that require external knowledge, which is often challenging for models relying solely on their internal parameters and pre-trained knowledge.
**Key Contributions:**
1. **Hierarchical Retrieval Pipeline:** Wiki-LLaVA employs a hierarchical retrieval pipeline to identify relevant passages from an external knowledge base, such as Wikipedia, which are then used as additional context for the MLLM.
2. **Model Architecture:** The proposed model, Wiki-LLaVA, integrates this retrieval module without significantly altering the structure of the MLLM, enhancing its ability to generate more precise and contextually rich responses.
3. **Evaluation:** Extensive experiments on datasets like Encyclopedic-VQA and InfoSeek demonstrate the effectiveness of Wiki-LLaVA in improving the accuracy and precision of answers, especially for questions requiring external knowledge.
**Related Work:**
- **Multimodal LLMs:** Recent advancements in MLLMs have shown significant improvements in handling visual and textual inputs, but they still struggle with specific or compositional reasoning tasks.
- **Retrieval-Augmented Language Models:** Techniques like retrieval-augmentation have been applied to expand the input space of language models, improving performance in knowledge-intensive tasks.
- **Knowledge-Based Visual Question Answering:** New benchmarks like EncyclopedicVQA and InfoSeek have raised the bar for visual question answering, requiring models to retrieve information from external sources.
**Methodology:**
- **Knowledge-Based Augmentation:** Wiki-LLaVA enriches the input context of the MLLM with relevant textual data from external memory, conditioning the model's output distribution.
- **Hierarchical Retrieval:** The model uses a two-step process to retrieve appropriate documents and passages from the external knowledge base, enhancing the model's ability to answer complex questions.
**Experiments:**
- **Datasets:** The experiments are conducted on Encyclopedic-VQA and InfoSeek, which contain large datasets of image-question pairs and associated answers.
- **Evaluation:** Results show that Wiki-LLaVA significantly improves the accuracy of answers, especially when using multiple textual chunks as additional context. The model also maintains the proficiency of the original MLLM across different tasks.
**Limitations and Future Work:**
- The study highlights the need for further research in defining proper embedding spaces for document retrieval and improving the model's ability to select relevant documents and passages.
Overall, Wiki-LLaVA represents a significant step towards enhancing MLLMs with external knowledge, demonstrating the potential of retrieval-augmented approaches in improving the effectiveness and precision of multimodal models.**Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs**
This paper introduces Wiki-LLaVA, a novel approach to enhance Multimodal Large Language Models (MLLMs) by integrating external knowledge from multimodal documents. The primary goal is to enable MLLMs to answer questions that require external knowledge, which is often challenging for models relying solely on their internal parameters and pre-trained knowledge.
**Key Contributions:**
1. **Hierarchical Retrieval Pipeline:** Wiki-LLaVA employs a hierarchical retrieval pipeline to identify relevant passages from an external knowledge base, such as Wikipedia, which are then used as additional context for the MLLM.
2. **Model Architecture:** The proposed model, Wiki-LLaVA, integrates this retrieval module without significantly altering the structure of the MLLM, enhancing its ability to generate more precise and contextually rich responses.
3. **Evaluation:** Extensive experiments on datasets like Encyclopedic-VQA and InfoSeek demonstrate the effectiveness of Wiki-LLaVA in improving the accuracy and precision of answers, especially for questions requiring external knowledge.
**Related Work:**
- **Multimodal LLMs:** Recent advancements in MLLMs have shown significant improvements in handling visual and textual inputs, but they still struggle with specific or compositional reasoning tasks.
- **Retrieval-Augmented Language Models:** Techniques like retrieval-augmentation have been applied to expand the input space of language models, improving performance in knowledge-intensive tasks.
- **Knowledge-Based Visual Question Answering:** New benchmarks like EncyclopedicVQA and InfoSeek have raised the bar for visual question answering, requiring models to retrieve information from external sources.
**Methodology:**
- **Knowledge-Based Augmentation:** Wiki-LLaVA enriches the input context of the MLLM with relevant textual data from external memory, conditioning the model's output distribution.
- **Hierarchical Retrieval:** The model uses a two-step process to retrieve appropriate documents and passages from the external knowledge base, enhancing the model's ability to answer complex questions.
**Experiments:**
- **Datasets:** The experiments are conducted on Encyclopedic-VQA and InfoSeek, which contain large datasets of image-question pairs and associated answers.
- **Evaluation:** Results show that Wiki-LLaVA significantly improves the accuracy of answers, especially when using multiple textual chunks as additional context. The model also maintains the proficiency of the original MLLM across different tasks.
**Limitations and Future Work:**
- The study highlights the need for further research in defining proper embedding spaces for document retrieval and improving the model's ability to select relevant documents and passages.
Overall, Wiki-LLaVA represents a significant step towards enhancing MLLMs with external knowledge, demonstrating the potential of retrieval-augmented approaches in improving the effectiveness and precision of multimodal models.