2407.08223v1 11 Jul 2024 | Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister
Speculative RAG is a framework that enhances retrieval augmented generation (RAG) by leveraging a larger generalist language model (LM) to efficiently verify multiple RAG drafts generated in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives on the evidence while reducing input token counts per draft. This approach enhances comprehension of each subset and mitigates potential position bias over long context. Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts. Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on TriviaQA, MuSiQue, PubHealth, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth. The framework introduces a novel RAG approach that employs a smaller specialist RAG drafter to generate high-quality draft answers. Each draft is derived from a distinct subset of retrieved documents, offering diverse perspectives while reducing input token counts per draft. The generalist LM, operating with the RAG drafter, requires no additional tuning. It simply verifies and integrates the most promising draft into the final answer. This approach enhances comprehension of each subset and mitigates potential lost-in-the-middle phenomenon. Our method significantly accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single, unbiased verification pass over the drafts in parallel. Extensive experiments on four free-form question-answering and closed-set generation benchmarks demonstrate the superior effectiveness and efficiency of the method. The framework also introduces a specialist RAG drafter that is instruction-tuned to generate both the answer draft and the rationale to better understand the contextual documents. The generalist LM can be any off-the-shelf pre-trained LM. We only consider the draft-rationale pair and skip the tedious and redundant retrieval results. We resort to the language modeling ability of the generalist LM to rank and select the draft-rationale pairs. Evaluation scores include self-consistency and self-reflection scores, which are computed using the language modeling ability of the generalist LM. The results show that SPECULATIVE RAG consistently outperforms all baselines across all four benchmarks. The method also reduces latency significantly, achieving the lowest latency across all datasets. The framework demonstrates the effectiveness of the approach in reducing processing time while maintaining high performance.Speculative RAG is a framework that enhances retrieval augmented generation (RAG) by leveraging a larger generalist language model (LM) to efficiently verify multiple RAG drafts generated in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives on the evidence while reducing input token counts per draft. This approach enhances comprehension of each subset and mitigates potential position bias over long context. Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts. Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on TriviaQA, MuSiQue, PubHealth, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth. The framework introduces a novel RAG approach that employs a smaller specialist RAG drafter to generate high-quality draft answers. Each draft is derived from a distinct subset of retrieved documents, offering diverse perspectives while reducing input token counts per draft. The generalist LM, operating with the RAG drafter, requires no additional tuning. It simply verifies and integrates the most promising draft into the final answer. This approach enhances comprehension of each subset and mitigates potential lost-in-the-middle phenomenon. Our method significantly accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single, unbiased verification pass over the drafts in parallel. Extensive experiments on four free-form question-answering and closed-set generation benchmarks demonstrate the superior effectiveness and efficiency of the method. The framework also introduces a specialist RAG drafter that is instruction-tuned to generate both the answer draft and the rationale to better understand the contextual documents. The generalist LM can be any off-the-shelf pre-trained LM. We only consider the draft-rationale pair and skip the tedious and redundant retrieval results. We resort to the language modeling ability of the generalist LM to rank and select the draft-rationale pairs. Evaluation scores include self-consistency and self-reflection scores, which are computed using the language modeling ability of the generalist LM. The results show that SPECULATIVE RAG consistently outperforms all baselines across all four benchmarks. The method also reduces latency significantly, achieving the lowest latency across all datasets. The framework demonstrates the effectiveness of the approach in reducing processing time while maintaining high performance.