[slides and audio] Speculative RAG%3A Enhancing Retrieval Augmented Generation through Drafting

**Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting** **Authors:** Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister **Abstract:** Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities. This work introduces SPECULATIVE RAG, a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives while reducing input token counts per draft. This approach enhances comprehension and mitigates potential position bias over long context. The method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass. Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on benchmarks like TriviaQA, MuSiQue, PubHealth, and ARC-Challenge, enhancing accuracy by up to 12.97% and reducing latency by 51%. **Introduction:** Large language models (LLMs) excel in question answering tasks but struggle with factual inaccuracies and hallucinated content in knowledge-intensive questions. RAG addresses this by incorporating external knowledge, reducing factual errors. However, retrieving multiple documents increases input length and computational efficiency. SPECULATIVE RAG offloads drafting to a smaller, specialized LM, generating multiple drafts from diverse document subsets. The generalist LM then verifies these drafts, selecting the most accurate answer. This approach enhances comprehension and mitigates position bias, achieving superior performance and efficiency. **Related Work:** RAG systems focus on improving contextual information quality and addressing long context challenges. SPECULATIVE RAG extends speculative decoding, which reduces auto-regressive decoding latency by drafting and verifying multiple tokens in parallel, to answer-level drafting. **Problem Formulation:** In knowledge-intensive tasks, each entry is $(Q, D, A)$, where $Q$ is a question, $D$ is a set of retrieved documents, and $A$ is the expected answer. The objective is to generate a fluent response containing the expected answer. **SPECULATIVE RAG Overview:** SPECULATIVE RAG enhances reasoning over retrieved documents without compromising speed. It uses a smaller specialist LM (RAG drafter) to generate multiple answer drafts from diverse document subsets, then a larger generalist LM (RAG verifier) to select the best answer based on rationale confidence scores. **Experiments:** SPECULATIVE RAG is evaluated on TriviaQA**Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting** **Authors:** Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister **Abstract:** Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities. This work introduces SPECULATIVE RAG, a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives while reducing input token counts per draft. This approach enhances comprehension and mitigates potential position bias over long context. The method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass. Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on benchmarks like TriviaQA, MuSiQue, PubHealth, and ARC-Challenge, enhancing accuracy by up to 12.97% and reducing latency by 51%. **Introduction:** Large language models (LLMs) excel in question answering tasks but struggle with factual inaccuracies and hallucinated content in knowledge-intensive questions. RAG addresses this by incorporating external knowledge, reducing factual errors. However, retrieving multiple documents increases input length and computational efficiency. SPECULATIVE RAG offloads drafting to a smaller, specialized LM, generating multiple drafts from diverse document subsets. The generalist LM then verifies these drafts, selecting the most accurate answer. This approach enhances comprehension and mitigates position bias, achieving superior performance and efficiency. **Related Work:** RAG systems focus on improving contextual information quality and addressing long context challenges. SPECULATIVE RAG extends speculative decoding, which reduces auto-regressive decoding latency by drafting and verifying multiple tokens in parallel, to answer-level drafting. **Problem Formulation:** In knowledge-intensive tasks, each entry is $(Q, D, A)$, where $Q$ is a question, $D$ is a set of retrieved documents, and $A$ is the expected answer. The objective is to generate a fluent response containing the expected answer. **SPECULATIVE RAG Overview:** SPECULATIVE RAG enhances reasoning over retrieved documents without compromising speed. It uses a smaller specialist LM (RAG drafter) to generate multiple answer drafts from diverse document subsets, then a larger generalist LM (RAG verifier) to select the best answer based on rationale confidence scores. **Experiments:** SPECULATIVE RAG is evaluated on TriviaQA

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

11 Jul 2024 | Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister