[slides] D%C3%A9j%C3%A0Vu%3A KV-cache Streaming for Fast%2C Fault-tolerant Generative LLM Serving

**DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving** **Abstract:** Distributed large language model (LLM) serving is costly and often underutilizes hardware accelerators due to three key challenges: pipeline-parallel deployments with bimodal latency between prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. This paper proposes DéjàVu, a system that addresses these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). DéjàVuLib enables efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault tolerance. The efficacy of these solutions is demonstrated on a range of large models across cloud deployments. **Introduction:** Large LLMs like GPT-3, OPT, and BLOOM are widely used in chatbots, code generation, and text summarization. Two key trends in generative LLM inference are large model sizes and input sequence lengths, leading to high memory footprints and the need for parallelization across multiple GPUs. The use of a Key-Value (KV) cache for storing prior computations introduces statefulness, which can lead to underutilization of GPUs due to bimodal latency between prompt processing and token generation. Existing systems often overprovision GPU memory and lack efficient failure handling mechanisms. **Challenges:** 1. **Bimodal Latency:** Prompt processing is compute-bound, while token generation is memory bandwidth-bound, leading to pipeline bubbles. 2. **GPU Memory Overprovisioning:** Systems preallocate GPU memory for all microbatches upfront, leading to underutilization. 3. **Fault Handling:** Failures in distributed setups cause all in-flight requests to stall, requiring restarts and increasing end-to-end latency. **Proposed Solutions:** 1. **Disaggregation:** Allocate separate machines for prompt processing and token generation to reduce pipeline bubbles. 2. **Microbatch Swapping:** Swap KV cache between GPU and CPU at the microbatch level to optimize GPU memory usage. 3. **State Replication:** Replicate KV cache in persistent storage or remote CPU memory to handle failures and minimize recovery time. **DéjàVu System:** DéjàVu is built on FasterTransformer, supporting tensor and pipeline parallelism. It uses DéjàVuLib, a modular KV cache streaming library, to implement these optimizations. DéjàVuLib offers primitives for efficient KV cache streaming under various configurations. **Evaluation:** DéjàVu improves LLM serving throughput by up to 2× compared to FasterTransformer, with microbatch swapping improving throughput by up to 1.8×. It also reduces microbatch latency by 1.54× in the presence of failures. **Conclusion:** DéjàVu is a comprehensive system for efficient and fault-tolerant LLM serving, addressing key challenges in distributed LLM inference. It offers**DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving** **Abstract:** Distributed large language model (LLM) serving is costly and often underutilizes hardware accelerators due to three key challenges: pipeline-parallel deployments with bimodal latency between prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. This paper proposes DéjàVu, a system that addresses these challenges using a versatile and efficient KV cache streaming library (DéjàVuLib). DéjàVuLib enables efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault tolerance. The efficacy of these solutions is demonstrated on a range of large models across cloud deployments. **Introduction:** Large LLMs like GPT-3, OPT, and BLOOM are widely used in chatbots, code generation, and text summarization. Two key trends in generative LLM inference are large model sizes and input sequence lengths, leading to high memory footprints and the need for parallelization across multiple GPUs. The use of a Key-Value (KV) cache for storing prior computations introduces statefulness, which can lead to underutilization of GPUs due to bimodal latency between prompt processing and token generation. Existing systems often overprovision GPU memory and lack efficient failure handling mechanisms. **Challenges:** 1. **Bimodal Latency:** Prompt processing is compute-bound, while token generation is memory bandwidth-bound, leading to pipeline bubbles. 2. **GPU Memory Overprovisioning:** Systems preallocate GPU memory for all microbatches upfront, leading to underutilization. 3. **Fault Handling:** Failures in distributed setups cause all in-flight requests to stall, requiring restarts and increasing end-to-end latency. **Proposed Solutions:** 1. **Disaggregation:** Allocate separate machines for prompt processing and token generation to reduce pipeline bubbles. 2. **Microbatch Swapping:** Swap KV cache between GPU and CPU at the microbatch level to optimize GPU memory usage. 3. **State Replication:** Replicate KV cache in persistent storage or remote CPU memory to handle failures and minimize recovery time. **DéjàVu System:** DéjàVu is built on FasterTransformer, supporting tensor and pipeline parallelism. It uses DéjàVuLib, a modular KV cache streaming library, to implement these optimizations. DéjàVuLib offers primitives for efficient KV cache streaming under various configurations. **Evaluation:** DéjàVu improves LLM serving throughput by up to 2× compared to FasterTransformer, with microbatch swapping improving throughput by up to 1.8×. It also reduces microbatch latency by 1.54× in the presence of failures. **Conclusion:** DéjàVu is a comprehensive system for efficient and fault-tolerant LLM serving, addressing key challenges in distributed LLM inference. It offers

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

4 Mar 2024 | Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic