17 Jun 2024 | Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Sarathi-Serve is an efficient LLM inference scheduler that addresses the throughput-latency tradeoff in LLM inference. LLM inference consists of two phases: prefill and decode. Prefill processes the entire input prompt in parallel, leading to high latency but high GPU utilization. Decode generates tokens one at a time, resulting in low latency but low GPU utilization. Batching improves decode throughput but causes interleaving of prefill and decode iterations, making it challenging to achieve both high throughput and low latency.
Sarathi-Serve introduces chunked-prefills, which split prefill requests into smaller chunks, and stall-free scheduling, which allows new requests to join a batch without pausing ongoing decodes. These techniques enable high throughput while minimizing the impact of batching on latency. Sarathi-Serve also uses uniform batches to reduce pipeline bubbles, improving GPU utilization and enabling efficient, scalable deployments.
Sarathi-Serve achieves significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on a single A100 GPU, Sarathi-Serve achieves 2.6× higher serving capacity than vLLM. For Yi-34B on two A100 GPUs, it achieves up to 3.7× higher serving capacity. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6× gain in end-to-end serving capacity.
The main contributions of this paper include identifying pitfalls in current LLM serving systems, introducing chunked-prefills and stall-free batching to improve performance, and demonstrating the generality of Sarathi-Serve through extensive evaluation across multiple models, hardware, and parallelism strategies. Sarathi-Serve improves model serving capacity by up to an order of magnitude.Sarathi-Serve is an efficient LLM inference scheduler that addresses the throughput-latency tradeoff in LLM inference. LLM inference consists of two phases: prefill and decode. Prefill processes the entire input prompt in parallel, leading to high latency but high GPU utilization. Decode generates tokens one at a time, resulting in low latency but low GPU utilization. Batching improves decode throughput but causes interleaving of prefill and decode iterations, making it challenging to achieve both high throughput and low latency.
Sarathi-Serve introduces chunked-prefills, which split prefill requests into smaller chunks, and stall-free scheduling, which allows new requests to join a batch without pausing ongoing decodes. These techniques enable high throughput while minimizing the impact of batching on latency. Sarathi-Serve also uses uniform batches to reduce pipeline bubbles, improving GPU utilization and enabling efficient, scalable deployments.
Sarathi-Serve achieves significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on a single A100 GPU, Sarathi-Serve achieves 2.6× higher serving capacity than vLLM. For Yi-34B on two A100 GPUs, it achieves up to 3.7× higher serving capacity. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6× gain in end-to-end serving capacity.
The main contributions of this paper include identifying pitfalls in current LLM serving systems, introducing chunked-prefills and stall-free batching to improve performance, and demonstrating the generality of Sarathi-Serve through extensive evaluation across multiple models, hardware, and parallelism strategies. Sarathi-Serve improves model serving capacity by up to an order of magnitude.