CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

3 Jun 2024 | Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang
CacheBlend is a system designed to accelerate large language model (LLM) serving for retrieval-augmented generation (RAG) by efficiently reusing precomputed key-value (KV) caches. The main challenge is to combine multiple precomputed KV caches from different text chunks in an LLM input without compromising generation quality. CacheBlend addresses this by selectively recomputing the KV values of a small subset of tokens, allowing for faster inference while maintaining high-quality outputs. Traditional methods like prefix caching and full KV reuse have limitations. Prefix caching only reuses KV caches of the input prefix, which is insufficient when multiple text chunks are used. Full KV reuse, while faster, often neglects cross-attention between chunks, leading to reduced generation quality. CacheBlend improves upon both by selectively recomputing KV values of tokens with high deviation from the full KV recompute, ensuring that the resulting KV cache closely matches the full prefill. This selective recomputation allows CacheBlend to achieve a significant reduction in time-to-first-token (TTFT) and increase inference throughput without sacrificing generation quality. CacheBlend's approach is based on the observation that attention matrices are sparse, meaning that only a small fraction of tokens have significant attention values. By focusing on these high-deviation tokens, CacheBlend minimizes the need for full KV recomputation while maintaining the necessary information for accurate generation. The system also leverages pipeline parallelism to hide the extra delay from recomputing tokens, allowing KV caches to be stored in slower, more cost-effective devices without increasing inference latency. Evaluation on three open-source LLMs and four benchmark datasets shows that CacheBlend reduces TTFT by 2.2–3.3× and increases inference throughput by 2.8–5× compared to full KV recompute, while maintaining high generation quality. It also outperforms other methods like prefix caching and full KV reuse in terms of both speed and accuracy. The system's ability to efficiently manage KV caches across different storage devices further enhances its performance, making it a promising solution for accelerating LLM inference in RAG scenarios.CacheBlend is a system designed to accelerate large language model (LLM) serving for retrieval-augmented generation (RAG) by efficiently reusing precomputed key-value (KV) caches. The main challenge is to combine multiple precomputed KV caches from different text chunks in an LLM input without compromising generation quality. CacheBlend addresses this by selectively recomputing the KV values of a small subset of tokens, allowing for faster inference while maintaining high-quality outputs. Traditional methods like prefix caching and full KV reuse have limitations. Prefix caching only reuses KV caches of the input prefix, which is insufficient when multiple text chunks are used. Full KV reuse, while faster, often neglects cross-attention between chunks, leading to reduced generation quality. CacheBlend improves upon both by selectively recomputing KV values of tokens with high deviation from the full KV recompute, ensuring that the resulting KV cache closely matches the full prefill. This selective recomputation allows CacheBlend to achieve a significant reduction in time-to-first-token (TTFT) and increase inference throughput without sacrificing generation quality. CacheBlend's approach is based on the observation that attention matrices are sparse, meaning that only a small fraction of tokens have significant attention values. By focusing on these high-deviation tokens, CacheBlend minimizes the need for full KV recomputation while maintaining the necessary information for accurate generation. The system also leverages pipeline parallelism to hide the extra delay from recomputing tokens, allowing KV caches to be stored in slower, more cost-effective devices without increasing inference latency. Evaluation on three open-source LLMs and four benchmark datasets shows that CacheBlend reduces TTFT by 2.2–3.3× and increases inference throughput by 2.8–5× compared to full KV recompute, while maintaining high generation quality. It also outperforms other methods like prefix caching and full KV reuse in terms of both speed and accuracy. The system's ability to efficiently manage KV caches across different storage devices further enhances its performance, making it a promising solution for accelerating LLM inference in RAG scenarios.
Reach us at info@study.space