DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

2024 | Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
DéjàVu is a system designed to improve the efficiency and fault tolerance of large language model (LLM) serving. It addresses three key challenges in distributed LLM serving: pipeline bubbles caused by the difference in latency between prompt processing and token generation, inefficient GPU memory usage due to overprovisioning of the key-value (KV) cache, and the lack of fault tolerance in handling failures. DéjàVu uses a versatile and efficient KV cache streaming library, DéjàVuLib, to implement solutions for these challenges. To reduce pipeline bubbles, DéjàVu separates prompt processing from token generation, optimizing resource allocation for each phase. This minimizes idle time and improves throughput. To optimize GPU memory usage, DéjàVu swaps the KV cache between the GPU and CPU at the microbatch level, allowing more efficient use of GPU memory and enabling larger batch sizes. For fault tolerance, DéjàVu replicates the KV cache in persistent storage or remote CPU memory, allowing for quick recovery from failures and minimizing redundant work. DéjàVuLib is a modular library that enables fast streaming of the KV cache for various configurations, including streaming between local or remote machines and for different KV cache structures. It includes optimizations such as buffered copies, layer-by-layer prompt cache streaming, and token computation and streaming parallelization, which reduce overhead and improve performance. Evaluation of DéjàVu shows that it improves LLM serving throughput by up to 2× compared to FasterTransformer. In the presence of system failures, DéjàVu reduces microbatch latency by 1.54× compared to non-fault-tolerant systems. DéjàVu also demonstrates better performance in handling microbatch swapping, allowing larger batch sizes and increasing system throughput by up to 1.8×. Additionally, DéjàVu provides seamless recovery from failures, reducing the impact of failures on request latency. Overall, DéjàVu is a comprehensive system that addresses key challenges in LLM serving, offering high throughput, fault tolerance, and efficient memory usage. It leverages DéjàVuLib, a versatile KV cache streaming library, to achieve these improvements.DéjàVu is a system designed to improve the efficiency and fault tolerance of large language model (LLM) serving. It addresses three key challenges in distributed LLM serving: pipeline bubbles caused by the difference in latency between prompt processing and token generation, inefficient GPU memory usage due to overprovisioning of the key-value (KV) cache, and the lack of fault tolerance in handling failures. DéjàVu uses a versatile and efficient KV cache streaming library, DéjàVuLib, to implement solutions for these challenges. To reduce pipeline bubbles, DéjàVu separates prompt processing from token generation, optimizing resource allocation for each phase. This minimizes idle time and improves throughput. To optimize GPU memory usage, DéjàVu swaps the KV cache between the GPU and CPU at the microbatch level, allowing more efficient use of GPU memory and enabling larger batch sizes. For fault tolerance, DéjàVu replicates the KV cache in persistent storage or remote CPU memory, allowing for quick recovery from failures and minimizing redundant work. DéjàVuLib is a modular library that enables fast streaming of the KV cache for various configurations, including streaming between local or remote machines and for different KV cache structures. It includes optimizations such as buffered copies, layer-by-layer prompt cache streaming, and token computation and streaming parallelization, which reduce overhead and improve performance. Evaluation of DéjàVu shows that it improves LLM serving throughput by up to 2× compared to FasterTransformer. In the presence of system failures, DéjàVu reduces microbatch latency by 1.54× compared to non-fault-tolerant systems. DéjàVu also demonstrates better performance in handling microbatch swapping, allowing larger batch sizes and increasing system throughput by up to 1.8×. Additionally, DéjàVu provides seamless recovery from failures, reducing the impact of failures on request latency. Overall, DéjàVu is a comprehensive system that addresses key challenges in LLM serving, offering high throughput, fault tolerance, and efficient memory usage. It leverages DéjàVuLib, a versatile KV cache streaming library, to achieve these improvements.
Reach us at info@study.space