LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

15 Apr 2024 | Bingyang Wu¹, Shengyu Liu¹, Yinmin Zhong¹, Peng Sun², Xuanzhe Liu¹, Xin Jin¹
LoongServe is an efficient LLM serving system that addresses the challenges of serving long-context LLMs by introducing Elastic Sequence Parallelism (ESP). Traditional static parallelism strategies are inefficient for handling variable-length requests and different phases of the same request. ESP dynamically adjusts the degree of parallelism (DoP) in real-time, improving computation, communication, and GPU memory efficiency. LoongServe achieves up to 3.85× higher throughput than chunked prefill and 5.81× higher than prefill-decoding disaggregation on real-world datasets. ESP allows for elastic scaling of parallelism, adapting to varying request lengths and phases. For the prefill phase, it uses proactive scaling-down to reuse communication and reduce overhead. For the decoding phase, it employs multi-master decoding to avoid key-value cache migration and overlap communication with computation. These mechanisms eliminate GPU memory fragmentation and enable efficient token-level KV cache allocation. LoongServe's global manager dynamically adjusts DoP, batching, and key-value cache placement based on real-time profiling. It uses a scalable four-step scheduling algorithm to make decisions at the iteration level with polynomial complexity. The system supports elastic scaling up and down without additional overhead, ensuring efficient resource utilization. The global manager manages requests, elastic instances, and the distributed KV cache pool. It optimizes scheduling by considering GPU computing and memory constraints, balancing prefill and decoding phases. It uses dynamic programming to optimize batching strategies, minimizing input latency. LoongServe is implemented in C++, CUDA, Python, and OpenAI Triton, reusing components from LightLLM and vLLM. It supports multiple dynamic parallel groups and uses NCCL for communication. The system is compatible with various attention mechanisms and has been evaluated on real-world workloads, demonstrating improved performance compared to state-of-the-art solutions.LoongServe is an efficient LLM serving system that addresses the challenges of serving long-context LLMs by introducing Elastic Sequence Parallelism (ESP). Traditional static parallelism strategies are inefficient for handling variable-length requests and different phases of the same request. ESP dynamically adjusts the degree of parallelism (DoP) in real-time, improving computation, communication, and GPU memory efficiency. LoongServe achieves up to 3.85× higher throughput than chunked prefill and 5.81× higher than prefill-decoding disaggregation on real-world datasets. ESP allows for elastic scaling of parallelism, adapting to varying request lengths and phases. For the prefill phase, it uses proactive scaling-down to reuse communication and reduce overhead. For the decoding phase, it employs multi-master decoding to avoid key-value cache migration and overlap communication with computation. These mechanisms eliminate GPU memory fragmentation and enable efficient token-level KV cache allocation. LoongServe's global manager dynamically adjusts DoP, batching, and key-value cache placement based on real-time profiling. It uses a scalable four-step scheduling algorithm to make decisions at the iteration level with polynomial complexity. The system supports elastic scaling up and down without additional overhead, ensuring efficient resource utilization. The global manager manages requests, elastic instances, and the distributed KV cache pool. It optimizes scheduling by considering GPU computing and memory constraints, balancing prefill and decoding phases. It uses dynamic programming to optimize batching strategies, minimizing input latency. LoongServe is implemented in C++, CUDA, Python, and OpenAI Triton, reusing components from LightLLM and vLLM. It supports multiple dynamic parallel groups and uses NCCL for communication. The system is compatible with various attention mechanisms and has been evaluated on real-world workloads, demonstrating improved performance compared to state-of-the-art solutions.
Reach us at info@study.space