October 23-26, 2023, Koblenz, Germany | Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
Efficient Memory Management for Large Language Model Serving with PagedAttention
This paper presents PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems, and vLLM, an LLM serving system that achieves near-zero waste in KV cache memory and flexible sharing of KV cache within and across requests. PagedAttention divides the KV cache into blocks, each containing the attention keys and values of a fixed number of tokens. This allows more flexible memory management, reducing internal and external fragmentation. vLLM uses block-level memory management and preemptive request scheduling to achieve high throughput. Evaluations show that vLLM improves the throughput of popular LLMs by 2-4× compared to state-of-the-art systems like FasterTransformer and Orca, with improvements more pronounced for longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.
LLMs require efficient memory management due to the large KV cache memory needed for each request. Existing systems struggle with this due to dynamic growth and shrinking of the KV cache, leading to memory fragmentation and redundancy. PagedAttention addresses this by allowing the KV cache to be stored in non-contiguous memory, reducing fragmentation and enabling memory sharing. vLLM uses PagedAttention to manage the KV cache in a paged manner, enabling efficient memory usage and higher throughput.
The paper evaluates vLLM on various models and workloads, showing that it improves LLM serving throughput by 2-4× compared to state-of-the-art systems without affecting model accuracy. The improvements are more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's design allows for efficient memory management, reducing memory waste and enabling more requests to be processed in parallel.
The paper also discusses the challenges of memory management in LLM serving, including the large KV cache size, complex decoding algorithms, and scheduling for unknown input and output lengths. vLLM addresses these challenges by using PagedAttention and block-level memory management, enabling efficient memory usage and higher throughput. The system is designed to handle various decoding methods, including parallel sampling, beam search, and shared prefix, by sharing memory across sequences and reducing redundant computation.
vLLM is implemented as an end-to-end serving system with a FastAPI frontend and a GPU-based inference engine. It supports various decoding algorithms and is optimized for performance through kernel-level optimizations and efficient memory management. The system is evaluated on various models and workloads, showing significant improvements in throughput and memory efficiency compared to existing systems.Efficient Memory Management for Large Language Model Serving with PagedAttention
This paper presents PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems, and vLLM, an LLM serving system that achieves near-zero waste in KV cache memory and flexible sharing of KV cache within and across requests. PagedAttention divides the KV cache into blocks, each containing the attention keys and values of a fixed number of tokens. This allows more flexible memory management, reducing internal and external fragmentation. vLLM uses block-level memory management and preemptive request scheduling to achieve high throughput. Evaluations show that vLLM improves the throughput of popular LLMs by 2-4× compared to state-of-the-art systems like FasterTransformer and Orca, with improvements more pronounced for longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.
LLMs require efficient memory management due to the large KV cache memory needed for each request. Existing systems struggle with this due to dynamic growth and shrinking of the KV cache, leading to memory fragmentation and redundancy. PagedAttention addresses this by allowing the KV cache to be stored in non-contiguous memory, reducing fragmentation and enabling memory sharing. vLLM uses PagedAttention to manage the KV cache in a paged manner, enabling efficient memory usage and higher throughput.
The paper evaluates vLLM on various models and workloads, showing that it improves LLM serving throughput by 2-4× compared to state-of-the-art systems without affecting model accuracy. The improvements are more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's design allows for efficient memory management, reducing memory waste and enabling more requests to be processed in parallel.
The paper also discusses the challenges of memory management in LLM serving, including the large KV cache size, complex decoding algorithms, and scheduling for unknown input and output lengths. vLLM addresses these challenges by using PagedAttention and block-level memory management, enabling efficient memory usage and higher throughput. The system is designed to handle various decoding methods, including parallel sampling, beam search, and shared prefix, by sharing memory across sequences and reducing redundant computation.
vLLM is implemented as an end-to-end serving system with a FastAPI frontend and a GPU-based inference engine. It supports various decoding algorithms and is optimized for performance through kernel-level optimizations and efficient memory management. The system is evaluated on various models and workloads, showing significant improvements in throughput and memory efficiency compared to existing systems.