17 Jun 2024 | Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee
The paper "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve" addresses the challenge of balancing throughput and latency in large language model (LLM) inference. The authors introduce Sarathi-Serve, an efficient LLM inference scheduler that introduces *chunked-prefills* and *stall-free scheduling* to improve performance.
**Key Contributions:**
1. **Chunked-Prefills:** Split prefill requests into smaller, equal-sized chunks to maximize GPU compute utilization.
2. **Stall-Free Scheduling:** Allow new requests to join a running batch without pausing ongoing decodes, reducing latency spikes and improving throughput.
**Motivation:**
- **Throughput-Latency Tradeoff:** Current LLM serving systems struggle to balance throughput and latency due to the contrasting behaviors of prefill and decode phases. Batching boosts decode throughput but has minimal effect on prefill throughput.
- **Generation Stalls:** Prefill-prioritizing schedulers introduce generation stalls, where prefill requests delay ongoing decodes, leading to high tail latency.
- **Pipeline Bubbles:** Pipeline-parallelism introduces pipeline bubbles, causing GPU inactivity and reducing overall system throughput.
**Evaluation:**
- **Model and Hardware:** evaluated on models like Mistral-7B, Yi-34B, LLaMA2-70B, and Falcon-180B, and hardware configurations including A100 and A40 GPUs.
- **Throughput and Latency:** Sarathi-Serve achieved up to 2.6× higher serving capacity for Mistral-7B on a single A100 GPU and up to 3.7× higher serving capacity for Yi-34B on two A100 GPUs compared to vLLM. For Falcon-180B with pipeline and tensor parallelism, it provided up to 5.6× gain in end-to-end serving capacity.
**Conclusion:**
Sarathi-Serve effectively balances throughput and latency by leveraging chunked-prefills and stall-free scheduling, improving LLM inference performance across various models and hardware configurations.The paper "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve" addresses the challenge of balancing throughput and latency in large language model (LLM) inference. The authors introduce Sarathi-Serve, an efficient LLM inference scheduler that introduces *chunked-prefills* and *stall-free scheduling* to improve performance.
**Key Contributions:**
1. **Chunked-Prefills:** Split prefill requests into smaller, equal-sized chunks to maximize GPU compute utilization.
2. **Stall-Free Scheduling:** Allow new requests to join a running batch without pausing ongoing decodes, reducing latency spikes and improving throughput.
**Motivation:**
- **Throughput-Latency Tradeoff:** Current LLM serving systems struggle to balance throughput and latency due to the contrasting behaviors of prefill and decode phases. Batching boosts decode throughput but has minimal effect on prefill throughput.
- **Generation Stalls:** Prefill-prioritizing schedulers introduce generation stalls, where prefill requests delay ongoing decodes, leading to high tail latency.
- **Pipeline Bubbles:** Pipeline-parallelism introduces pipeline bubbles, causing GPU inactivity and reducing overall system throughput.
**Evaluation:**
- **Model and Hardware:** evaluated on models like Mistral-7B, Yi-34B, LLaMA2-70B, and Falcon-180B, and hardware configurations including A100 and A40 GPUs.
- **Throughput and Latency:** Sarathi-Serve achieved up to 2.6× higher serving capacity for Mistral-7B on a single A100 GPU and up to 3.7× higher serving capacity for Yi-34B on two A100 GPUs compared to vLLM. For Falcon-180B with pipeline and tensor parallelism, it provided up to 5.6× gain in end-to-end serving capacity.
**Conclusion:**
Sarathi-Serve effectively balances throughput and latency by leveraging chunked-prefills and stall-free scheduling, improving LLM inference performance across various models and hardware configurations.