LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

30 Jun 2024 | Ziyan Jiang, Xueguang Ma, Wenhuh Chen
**LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs** **Abstract:** Traditional RAG frameworks often use short retrieval units, which can lead to an imbalanced design where the retriever has a heavy workload while the reader has a lighter task. To address this issue, we propose LongRAG, a new framework that processes the entire Wikipedia into 4K-token units, significantly reducing the corpus size and improving retrieval performance. By feeding the top-k retrieved units to a long-context LLM, LongRAG achieves strong performance on open-domain question-answering tasks like NQ and HotpotQA. Our study offers insights into combining RAG with long-context LLMs. **Introduction:** Retrieval-Augmented Generation (RAG) methods enhance large language models (LLMs) by leveraging external corpora. Traditional RAG frameworks use short retrieval units, which can lead to semantic incompleteness and a heavy burden on the retriever. LongRAG addresses this by using long retrieval units (30x longer) and a long retriever and long reader. The long retriever identifies coarse relevant information from the corpus, while the long reader extracts answers from the retrieved units using a long-context LLM. **Key Contributions:** 1. **Long Retrieval Unit:** Constructing long retrieval units from Wikipedia documents or grouped related documents. 2. **Long Retriever:** Identifies coarse relevant information from the corpus. 3. **Long Reader:** Extracts answers from the concatenated retrieved units using a long-context LLM. **Experiments:** - **Datasets:** Natural Questions (NQ) and HotpotQA. - **Results:** LongRAG achieves high answer recall scores and EM rates on NQ and HotpotQA, comparable to fully-trained RAG models. **Ablation Studies:** - **Retrieval Unit Selection:** Different retrieval unit sizes and types impact performance. - **Recall vs. EM:** Higher recall does not always lead to better end performance. - **Reader Model:** GPT-4o performs best among the tested readers. **Conclusion:** LongRAG significantly improves RAG performance by balancing the workload between the retriever and reader, leveraging long-context LLMs. Future work should focus on improving long embedding models and generalizing the grouping methods.**LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs** **Abstract:** Traditional RAG frameworks often use short retrieval units, which can lead to an imbalanced design where the retriever has a heavy workload while the reader has a lighter task. To address this issue, we propose LongRAG, a new framework that processes the entire Wikipedia into 4K-token units, significantly reducing the corpus size and improving retrieval performance. By feeding the top-k retrieved units to a long-context LLM, LongRAG achieves strong performance on open-domain question-answering tasks like NQ and HotpotQA. Our study offers insights into combining RAG with long-context LLMs. **Introduction:** Retrieval-Augmented Generation (RAG) methods enhance large language models (LLMs) by leveraging external corpora. Traditional RAG frameworks use short retrieval units, which can lead to semantic incompleteness and a heavy burden on the retriever. LongRAG addresses this by using long retrieval units (30x longer) and a long retriever and long reader. The long retriever identifies coarse relevant information from the corpus, while the long reader extracts answers from the retrieved units using a long-context LLM. **Key Contributions:** 1. **Long Retrieval Unit:** Constructing long retrieval units from Wikipedia documents or grouped related documents. 2. **Long Retriever:** Identifies coarse relevant information from the corpus. 3. **Long Reader:** Extracts answers from the concatenated retrieved units using a long-context LLM. **Experiments:** - **Datasets:** Natural Questions (NQ) and HotpotQA. - **Results:** LongRAG achieves high answer recall scores and EM rates on NQ and HotpotQA, comparable to fully-trained RAG models. **Ablation Studies:** - **Retrieval Unit Selection:** Different retrieval unit sizes and types impact performance. - **Recall vs. EM:** Higher recall does not always lead to better end performance. - **Reader Model:** GPT-4o performs best among the tested readers. **Conclusion:** LongRAG significantly improves RAG performance by balancing the workload between the retriever and reader, leveraging long-context LLMs. Future work should focus on improving long embedding models and generalizing the grouping methods.
Reach us at info@study.space