Accelerating Retrieval-Augmented Language Model Serving with Speculation

Accelerating Retrieval-Augmented Language Model Serving with Speculation

25 Jan 2024 | Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia
RaLMSpec is a framework that accelerates retrieval-augmented language model (RaLM) serving by reducing the overhead of iterative RaLM while preserving model output quality. It employs speculative retrieval with batched verification to improve efficiency. RaLMSpec leverages temporal and spatial locality of retrieved documents by maintaining a local cache for speculative retrieval. It also incorporates cache prefetching, an optimal speculation stride scheduler, and asynchronous verification to further reduce serving latency. RaLMSpec achieves significant speed-up ratios compared to baseline methods, with up to 7.59× speed-up for exact dense retrievers and 2.45× for approximate dense retrievers on KNN-LM serving. For naive iterative RaLM serving, RaLMSpec achieves speed-up ratios of 1.75–2.39×, 1.04–1.39×, and 1.31–1.77× for exact dense, approximate dense, and sparse retrievers, respectively. The framework is effective across different language models, datasets, and retriever types, demonstrating its versatility as a generic acceleration framework for iterative RaLM serving.RaLMSpec is a framework that accelerates retrieval-augmented language model (RaLM) serving by reducing the overhead of iterative RaLM while preserving model output quality. It employs speculative retrieval with batched verification to improve efficiency. RaLMSpec leverages temporal and spatial locality of retrieved documents by maintaining a local cache for speculative retrieval. It also incorporates cache prefetching, an optimal speculation stride scheduler, and asynchronous verification to further reduce serving latency. RaLMSpec achieves significant speed-up ratios compared to baseline methods, with up to 7.59× speed-up for exact dense retrievers and 2.45× for approximate dense retrievers on KNN-LM serving. For naive iterative RaLM serving, RaLMSpec achieves speed-up ratios of 1.75–2.39×, 1.04–1.39×, and 1.31–1.77× for exact dense, approximate dense, and sparse retrievers, respectively. The framework is effective across different language models, datasets, and retriever types, demonstrating its versatility as a generic acceleration framework for iterative RaLM serving.
Reach us at info@study.space