CACHEBLEND: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

CACHEBLEND: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

3 Jun 2024 | Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang
CACHEBLEND is a system that enables fast large language model (LLM) serving for retrieval-augmented generation (RAG) with cached knowledge fusion. It addresses the challenge of efficiently combining precomputed key-value (KV) caches from multiple text chunks in an LLM input to achieve generation quality comparable to full prefill without reusing KV caches. CACHEBLEND selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache, allowing it to store KV caches in slower devices while retrieving them without increasing inference delay. By comparing CACHEBLEND with state-of-the-art KV cache reusing schemes on three open-source LLMs and four benchmark datasets, it reduces time-to-first-token (TTFT) by 2.2–3.3× and increases inference throughput by 2.8–5× compared to full KV recompute, without compromising generation quality or incurring more storage cost. CACHEBLEND achieves this by pipelining the recomputation of a small fraction of tokens with the retrieval of KV caches, enabling it to store KV caches in slower non-volatile devices without increasing delay. It also selects tokens with high KV deviation for recomputation to minimize attention deviation and improve generation quality. CACHEBLEND is implemented on top of vLLM and shows significant improvements in TTFT and throughput across multiple models and datasets.CACHEBLEND is a system that enables fast large language model (LLM) serving for retrieval-augmented generation (RAG) with cached knowledge fusion. It addresses the challenge of efficiently combining precomputed key-value (KV) caches from multiple text chunks in an LLM input to achieve generation quality comparable to full prefill without reusing KV caches. CACHEBLEND selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache, allowing it to store KV caches in slower devices while retrieving them without increasing inference delay. By comparing CACHEBLEND with state-of-the-art KV cache reusing schemes on three open-source LLMs and four benchmark datasets, it reduces time-to-first-token (TTFT) by 2.2–3.3× and increases inference throughput by 2.8–5× compared to full KV recompute, without compromising generation quality or incurring more storage cost. CACHEBLEND achieves this by pipelining the recomputation of a small fraction of tokens with the retrieval of KV caches, enabling it to store KV caches in slower non-volatile devices without increasing delay. It also selects tokens with high KV deviation for recomputation to minimize attention deviation and improve generation quality. CACHEBLEND is implemented on top of vLLM and shows significant improvements in TTFT and throughput across multiple models and datasets.
Reach us at info@study.space