RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

25 Apr 2024 | Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin
**RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation** **Authors:** Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin **Affiliations:** Peking University, ByteDance Inc. **Abstract:** Retrieval-Augmented Generation (RAG) enhances natural language processing tasks by integrating large language models (LLMs) with external knowledge databases. However, RAG introduces long sequence generation, leading to high computational and memory costs. RAGCache is a novel multilevel dynamic caching system designed for RAG. It organizes intermediate states of retrieved knowledge in a knowledge tree and caches them in GPU and host memory. RAGCache proposes a prefix-aware Greedy-Dual-Size-Frequency (PGDSF) replacement policy and a dynamic speculative pipelining strategy to minimize end-to-end latency. Evaluations on vLLM and Faiss show that RAGCache reduces time to first token (TTFT) by up to 4× and improves throughput by up to 2.1× compared to vLLM integrated with Faiss. **Contributions:** - Conducted a detailed system characterization of RAG, identifying performance bottlenecks and optimization opportunities. - Proposed RAGCache, the first RAG system to cache and share intermediate states of external knowledge across multiple queries. - Designed a prefix-aware PGDSF replacement policy and a dynamic speculative pipelining approach to minimize latency and improve efficiency. **RAGCache Overview:** - Caches key-value tensors of retrieved documents across multiple requests. - Uses a knowledge tree to organize tensors, ensuring efficient prefix matching. - Employed a global RAG controller to manage interactions between the external knowledge database and LLM inference engine. - Includes cache-aware reordering to improve cache hit rate and dynamic speculative pipelining to overlap retrieval and inference steps. **Evaluation:** - Compared RAGCache with vLLM and SGLang on various datasets and models. - Showed that RAGCache reduces TTFT by up to 4× and improves throughput by up to 2.1×. - Demonstrated scalability with larger models and different attention mechanisms.**RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation** **Authors:** Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin **Affiliations:** Peking University, ByteDance Inc. **Abstract:** Retrieval-Augmented Generation (RAG) enhances natural language processing tasks by integrating large language models (LLMs) with external knowledge databases. However, RAG introduces long sequence generation, leading to high computational and memory costs. RAGCache is a novel multilevel dynamic caching system designed for RAG. It organizes intermediate states of retrieved knowledge in a knowledge tree and caches them in GPU and host memory. RAGCache proposes a prefix-aware Greedy-Dual-Size-Frequency (PGDSF) replacement policy and a dynamic speculative pipelining strategy to minimize end-to-end latency. Evaluations on vLLM and Faiss show that RAGCache reduces time to first token (TTFT) by up to 4× and improves throughput by up to 2.1× compared to vLLM integrated with Faiss. **Contributions:** - Conducted a detailed system characterization of RAG, identifying performance bottlenecks and optimization opportunities. - Proposed RAGCache, the first RAG system to cache and share intermediate states of external knowledge across multiple queries. - Designed a prefix-aware PGDSF replacement policy and a dynamic speculative pipelining approach to minimize latency and improve efficiency. **RAGCache Overview:** - Caches key-value tensors of retrieved documents across multiple requests. - Uses a knowledge tree to organize tensors, ensuring efficient prefix matching. - Employed a global RAG controller to manage interactions between the external knowledge database and LLM inference engine. - Includes cache-aware reordering to improve cache hit rate and dynamic speculative pipelining to overlap retrieval and inference steps. **Evaluation:** - Compared RAGCache with vLLM and SGLang on various datasets and models. - Showed that RAGCache reduces TTFT by up to 4× and improves throughput by up to 2.1×. - Demonstrated scalability with larger models and different attention mechanisms.
Reach us at info@study.space