12 Sep 2023 | Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
The paper addresses the challenge of efficient memory management in serving large language models (LLMs), which requires batching multiple requests to improve throughput. Existing systems struggle with the dynamic and large key-value cache (KV cache) memory, leading to significant memory waste and limiting batch sizes. To tackle this, the authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems. PagedAttention divides the KV cache into non-contiguous blocks, allowing flexible and efficient memory management. Building on PagedAttention, they develop vLLM, a high-throughput distributed LLM serving engine that achieves near-zero waste in KV cache memory and flexible sharing of KV cache within and across requests. Evaluations show that vLLM improves throughput by 2-4 times compared to state-of-the-art systems like FasterTransformer and Orca, with more pronounced improvements for longer sequences, larger models, and complex decoding algorithms. The source code for vLLM is publicly available.The paper addresses the challenge of efficient memory management in serving large language models (LLMs), which requires batching multiple requests to improve throughput. Existing systems struggle with the dynamic and large key-value cache (KV cache) memory, leading to significant memory waste and limiting batch sizes. To tackle this, the authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems. PagedAttention divides the KV cache into non-contiguous blocks, allowing flexible and efficient memory management. Building on PagedAttention, they develop vLLM, a high-throughput distributed LLM serving engine that achieves near-zero waste in KV cache memory and flexible sharing of KV cache within and across requests. Evaluations show that vLLM improves throughput by 2-4 times compared to state-of-the-art systems like FasterTransformer and Orca, with more pronounced improvements for longer sequences, larger models, and complex decoding algorithms. The source code for vLLM is publicly available.