[slides] RAGCache%3A Efficient Knowledge Caching for Retrieval-Augmented Generation

RAGCache is a novel multilevel dynamic caching system designed for Retrieval-Augmented Generation (RAG) to improve performance by reducing redundant computation. RAG integrates large language models (LLMs) with external knowledge databases to enhance generation quality. However, RAG introduces long sequence generation, leading to high computation and memory costs. RAGCache addresses this by caching intermediate states of retrieved documents in a knowledge tree, organized in GPU and host memory hierarchy. It employs a prefix-aware replacement policy (PGDSF) that considers document order, size, frequency, and recency to minimize cache misses. Additionally, RAGCache dynamically overlaps retrieval and inference steps to reduce end-to-end latency. The system also implements a cache-aware request scheduling approach to improve cache hit rate. RAGCache is evaluated on vLLM and Faiss, showing up to 4× improvement in time-to-first-token (TTFT) and 2.1× improvement in throughput compared to vLLM with Faiss. It also outperforms SGLang by up to 3.5× in TTFT and 1.8× in throughput. RAGCache's design includes a knowledge tree for organizing key-value tensors, a PGDSF replacement policy, and dynamic speculative pipelining to overlap retrieval and inference steps. The system is implemented in C++ and Python, with experiments conducted on AWS EC2 instances. RAGCache demonstrates significant performance improvements across various models and datasets, including MMLU and Natural Questions. The results show that RAGCache reduces TTFT and improves throughput compared to state-of-the-art systems, highlighting its effectiveness in optimizing RAG performance.RAGCache is a novel multilevel dynamic caching system designed for Retrieval-Augmented Generation (RAG) to improve performance by reducing redundant computation. RAG integrates large language models (LLMs) with external knowledge databases to enhance generation quality. However, RAG introduces long sequence generation, leading to high computation and memory costs. RAGCache addresses this by caching intermediate states of retrieved documents in a knowledge tree, organized in GPU and host memory hierarchy. It employs a prefix-aware replacement policy (PGDSF) that considers document order, size, frequency, and recency to minimize cache misses. Additionally, RAGCache dynamically overlaps retrieval and inference steps to reduce end-to-end latency. The system also implements a cache-aware request scheduling approach to improve cache hit rate. RAGCache is evaluated on vLLM and Faiss, showing up to 4× improvement in time-to-first-token (TTFT) and 2.1× improvement in throughput compared to vLLM with Faiss. It also outperforms SGLang by up to 3.5× in TTFT and 1.8× in throughput. RAGCache's design includes a knowledge tree for organizing key-value tensors, a PGDSF replacement policy, and dynamic speculative pipelining to overlap retrieval and inference steps. The system is implemented in C++ and Python, with experiments conducted on AWS EC2 instances. RAGCache demonstrates significant performance improvements across various models and datasets, including MMLU and Natural Questions. The results show that RAGCache reduces TTFT and improves throughput compared to state-of-the-art systems, highlighting its effectiveness in optimizing RAG performance.

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

25 Apr 2024 | Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin