[slides] Accelerating Retrieval-Augmented Language Model Serving with Speculation

**Retrieval-augmented language models (RaLMs)** have shown promise in solving knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. RaLMs excel at low-cost adaptation to new data and better source attribution mechanisms compared to fully parametric models. Among various RaLM approaches, iterative RaLMs deliver better generation quality due to frequent interactions between the retriever and the language model. However, these interactions often lead to high overheads. To address this, the authors propose **RaLMSpec**, a framework that reduces serving latency while preserving model outputs through speculative retrieval and batched verification. RaLMSpec leverages the temporal and spatial locality of retrieved documents to enable efficient speculative retrieval and uses a local cache to store past documents. Additional techniques, including cache prefetching, an optimal speculation stride scheduler, and asynchronous verification, further enhance performance. Extensive evaluations on four downstream QA datasets and KNN-LM serving demonstrate that RaLMSpec can achieve significant speed-ups, ranging from 1.04× to 7.59× compared to naive iterative RaLM serving and baseline implementations. The contributions of RaLMSpec include a framework that reduces serving latency for iterative RaLMs and techniques to further reduce latency.**Retrieval-augmented language models (RaLMs)** have shown promise in solving knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. RaLMs excel at low-cost adaptation to new data and better source attribution mechanisms compared to fully parametric models. Among various RaLM approaches, iterative RaLMs deliver better generation quality due to frequent interactions between the retriever and the language model. However, these interactions often lead to high overheads. To address this, the authors propose **RaLMSpec**, a framework that reduces serving latency while preserving model outputs through speculative retrieval and batched verification. RaLMSpec leverages the temporal and spatial locality of retrieved documents to enable efficient speculative retrieval and uses a local cache to store past documents. Additional techniques, including cache prefetching, an optimal speculation stride scheduler, and asynchronous verification, further enhance performance. Extensive evaluations on four downstream QA datasets and KNN-LM serving demonstrate that RaLMSpec can achieve significant speed-ups, ranging from 1.04× to 7.59× compared to naive iterative RaLM serving and baseline implementations. The contributions of RaLMSpec include a framework that reduces serving latency for iterative RaLMs and techniques to further reduce latency.

ACCELERATING RETRIEVAL-AUGMENTED LANGUAGE MODEL SERVING WITH SPECULATION

25 Jan 2024 | Zhihao Zhang†, Alan Zhu†, Lijie Yang†, Yihua Xu‡, Lanting Li†, Phitchaya Mangpo Phothilimthana†, Zhihao Jia††