30 Jun 2024 | Ziyan Jiang, Xueguang Ma, Wenhuh Chen
**LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs**
**Abstract:**
Traditional RAG frameworks often use short retrieval units, which can lead to an imbalanced design where the retriever has a heavy workload while the reader has a lighter task. To address this issue, we propose LongRAG, a new framework that processes the entire Wikipedia into 4K-token units, significantly reducing the corpus size and improving retrieval performance. By feeding the top-k retrieved units to a long-context LLM, LongRAG achieves strong performance on open-domain question-answering tasks like NQ and HotpotQA. Our study offers insights into combining RAG with long-context LLMs.
**Introduction:**
Retrieval-Augmented Generation (RAG) methods enhance large language models (LLMs) by leveraging external corpora. Traditional RAG frameworks use short retrieval units, which can lead to semantic incompleteness and a heavy burden on the retriever. LongRAG addresses this by using long retrieval units (30x longer) and a long retriever and long reader. The long retriever identifies coarse relevant information from the corpus, while the long reader extracts answers from the retrieved units using a long-context LLM.
**Key Contributions:**
1. **Long Retrieval Unit:** Constructing long retrieval units from Wikipedia documents or grouped related documents.
2. **Long Retriever:** Identifies coarse relevant information from the corpus.
3. **Long Reader:** Extracts answers from the concatenated retrieved units using a long-context LLM.
**Experiments:**
- **Datasets:** Natural Questions (NQ) and HotpotQA.
- **Results:** LongRAG achieves high answer recall scores and EM rates on NQ and HotpotQA, comparable to fully-trained RAG models.
**Ablation Studies:**
- **Retrieval Unit Selection:** Different retrieval unit sizes and types impact performance.
- **Recall vs. EM:** Higher recall does not always lead to better end performance.
- **Reader Model:** GPT-4o performs best among the tested readers.
**Conclusion:**
LongRAG significantly improves RAG performance by balancing the workload between the retriever and reader, leveraging long-context LLMs. Future work should focus on improving long embedding models and generalizing the grouping methods.**LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs**
**Abstract:**
Traditional RAG frameworks often use short retrieval units, which can lead to an imbalanced design where the retriever has a heavy workload while the reader has a lighter task. To address this issue, we propose LongRAG, a new framework that processes the entire Wikipedia into 4K-token units, significantly reducing the corpus size and improving retrieval performance. By feeding the top-k retrieved units to a long-context LLM, LongRAG achieves strong performance on open-domain question-answering tasks like NQ and HotpotQA. Our study offers insights into combining RAG with long-context LLMs.
**Introduction:**
Retrieval-Augmented Generation (RAG) methods enhance large language models (LLMs) by leveraging external corpora. Traditional RAG frameworks use short retrieval units, which can lead to semantic incompleteness and a heavy burden on the retriever. LongRAG addresses this by using long retrieval units (30x longer) and a long retriever and long reader. The long retriever identifies coarse relevant information from the corpus, while the long reader extracts answers from the retrieved units using a long-context LLM.
**Key Contributions:**
1. **Long Retrieval Unit:** Constructing long retrieval units from Wikipedia documents or grouped related documents.
2. **Long Retriever:** Identifies coarse relevant information from the corpus.
3. **Long Reader:** Extracts answers from the concatenated retrieved units using a long-context LLM.
**Experiments:**
- **Datasets:** Natural Questions (NQ) and HotpotQA.
- **Results:** LongRAG achieves high answer recall scores and EM rates on NQ and HotpotQA, comparable to fully-trained RAG models.
**Ablation Studies:**
- **Retrieval Unit Selection:** Different retrieval unit sizes and types impact performance.
- **Recall vs. EM:** Higher recall does not always lead to better end performance.
- **Reader Model:** GPT-4o performs best among the tested readers.
**Conclusion:**
LongRAG significantly improves RAG performance by balancing the workload between the retriever and reader, leveraging long-context LLMs. Future work should focus on improving long embedding models and generalizing the grouping methods.