**DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving**
**Authors:** Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
**Institution:** School of Computer Science, Peking University; StepFun; UC San Diego
**Abstract:**
DistServe improves the performance of large language models (LLMs) by separating the prefill and decoding computation. Existing LLM serving systems co-locate these phases and batch the computation, leading to strong interferences and resource allocation coupling. DistServe assigns prefill and decoding to different GPUs, eliminating these interferences and allowing independent optimization of resource allocation and parallelism strategies. The system also optimizes placement to minimize communication overhead. Evaluations show that DistServe can serve 7.4× more requests or achieve 12.6× tighter SLOs compared to state-of-the-art systems, while maintaining latency constraints for over 90% of requests.
**Introduction:**
LLMs, such as GPT-4, Bard, and LLaMA, are reshaping Internet services and enabling new applications. However, processing LLM queries is slower than standard searches, requiring over-provisioned compute resources to meet latency requirements. DistServe addresses this by separating the prefill and decoding phases, optimizing resource allocation and parallelism for each phase. This approach reduces interference and allows independent scaling, improving per-GPU goodput.
**Background and Motivation:**
LLM services follow a client-server architecture, with the server hosting the LLM on GPUs and running inference. The service must meet stringent latency requirements for both the time to first token (TTFT) and time per output token (TPOT). Existing systems often colocate the two phases to maximize throughput, but this leads to trade-offs between TTFT and TPOT.
**Disaggregation:**
Disaggregating the prefill and decoding phases allows each phase to focus on its specific latency requirements. This enables independent optimization of resource allocation and parallelism strategies, maximizing per-GPU goodput.
**Tradeoff Analysis:**
The analysis of the prefill and decoding phases post-disaggregation provides insights into optimal batching and parallelism strategies. For prefill instances, the goal is to minimize resources while meeting TTFT requirements. For decoding instances, the focus is on minimizing resources to meet TPOT requirements.
**Method:**
DistServe determines the parallelism strategies, instance counts, and placement for the prefill and decoding instances. It uses a simulator to estimate SLO attainment and optimizes placement to maximize per-GPU goodput. Online scheduling optimizations further enhance performance.
**Evaluation:**
DistServe is evaluated on various LLMs and applications, showing significant improvements in request rates and SLO attainment compared to baselines. The system demonstrates robustness to changing latency requirements and effective management of communication overhead**DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving**
**Authors:** Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
**Institution:** School of Computer Science, Peking University; StepFun; UC San Diego
**Abstract:**
DistServe improves the performance of large language models (LLMs) by separating the prefill and decoding computation. Existing LLM serving systems co-locate these phases and batch the computation, leading to strong interferences and resource allocation coupling. DistServe assigns prefill and decoding to different GPUs, eliminating these interferences and allowing independent optimization of resource allocation and parallelism strategies. The system also optimizes placement to minimize communication overhead. Evaluations show that DistServe can serve 7.4× more requests or achieve 12.6× tighter SLOs compared to state-of-the-art systems, while maintaining latency constraints for over 90% of requests.
**Introduction:**
LLMs, such as GPT-4, Bard, and LLaMA, are reshaping Internet services and enabling new applications. However, processing LLM queries is slower than standard searches, requiring over-provisioned compute resources to meet latency requirements. DistServe addresses this by separating the prefill and decoding phases, optimizing resource allocation and parallelism for each phase. This approach reduces interference and allows independent scaling, improving per-GPU goodput.
**Background and Motivation:**
LLM services follow a client-server architecture, with the server hosting the LLM on GPUs and running inference. The service must meet stringent latency requirements for both the time to first token (TTFT) and time per output token (TPOT). Existing systems often colocate the two phases to maximize throughput, but this leads to trade-offs between TTFT and TPOT.
**Disaggregation:**
Disaggregating the prefill and decoding phases allows each phase to focus on its specific latency requirements. This enables independent optimization of resource allocation and parallelism strategies, maximizing per-GPU goodput.
**Tradeoff Analysis:**
The analysis of the prefill and decoding phases post-disaggregation provides insights into optimal batching and parallelism strategies. For prefill instances, the goal is to minimize resources while meeting TTFT requirements. For decoding instances, the focus is on minimizing resources to meet TPOT requirements.
**Method:**
DistServe determines the parallelism strategies, instance counts, and placement for the prefill and decoding instances. It uses a simulator to estimate SLO attainment and optimizes placement to maximize per-GPU goodput. Online scheduling optimizations further enhance performance.
**Evaluation:**
DistServe is evaluated on various LLMs and applications, showing significant improvements in request rates and SLO attainment compared to baselines. The system demonstrates robustness to changing latency requirements and effective management of communication overhead