DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

6 Jun 2024 | Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
DistServe is a large language model (LLM) serving system that improves performance by separating the prefill and decoding phases. Existing systems combine these phases, leading to interference and resource allocation issues. DistServe assigns each phase to separate GPUs, reducing interference and optimizing resource allocation for each phase. It also places phases based on cluster bandwidth to minimize communication overhead. DistServe significantly improves LLM serving performance by maximizing the rate of requests served within TTFT and TPOT constraints. Evaluations show that DistServe can serve 7.4× more requests or achieve 12.6× tighter SLO compared to state-of-the-art systems, while meeting latency constraints for over 90% of requests. LLM services process user queries in two phases: prefill and decoding. Prefill generates the first token, while decoding generates subsequent tokens. Existing systems colocate these phases, leading to interference and resource allocation issues. DistServe separates the phases, allowing independent optimization for each. This approach reduces interference and improves performance by tailoring resource allocation and parallelism strategies to each phase's latency requirements. DistServe uses a placement algorithm to assign phases to GPUs based on cluster bandwidth, minimizing communication overhead. It also employs model parallelism and replication to scale the system. The system is implemented as an orchestration layer on top of an LLM inference engine, supporting various LLMs and workloads. DistServe outperforms existing systems in terms of request rate and SLO tightness, while maintaining latency constraints for most requests. The system is evaluated on various LLMs and workloads, showing significant improvements in performance.DistServe is a large language model (LLM) serving system that improves performance by separating the prefill and decoding phases. Existing systems combine these phases, leading to interference and resource allocation issues. DistServe assigns each phase to separate GPUs, reducing interference and optimizing resource allocation for each phase. It also places phases based on cluster bandwidth to minimize communication overhead. DistServe significantly improves LLM serving performance by maximizing the rate of requests served within TTFT and TPOT constraints. Evaluations show that DistServe can serve 7.4× more requests or achieve 12.6× tighter SLO compared to state-of-the-art systems, while meeting latency constraints for over 90% of requests. LLM services process user queries in two phases: prefill and decoding. Prefill generates the first token, while decoding generates subsequent tokens. Existing systems colocate these phases, leading to interference and resource allocation issues. DistServe separates the phases, allowing independent optimization for each. This approach reduces interference and improves performance by tailoring resource allocation and parallelism strategies to each phase's latency requirements. DistServe uses a placement algorithm to assign phases to GPUs based on cluster bandwidth, minimizing communication overhead. It also employs model parallelism and replication to scale the system. The system is implemented as an orchestration layer on top of an LLM inference engine, supporting various LLMs and workloads. DistServe outperforms existing systems in terms of request rate and SLO tightness, while maintaining latency constraints for most requests. The system is evaluated on various LLMs and workloads, showing significant improvements in performance.
Reach us at info@study.space
[slides] DistServe%3A Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | StudySpace