Llumnix: Dynamic Scheduling for Large Language Model Serving

Llumnix: Dynamic Scheduling for Large Language Model Serving

5 Jun 2024 | Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin
Llumnix is a dynamic scheduling system for large language model (LLM) serving that addresses the challenges of heterogeneous and unpredictable workloads. LLMs are used in diverse applications, leading to varying request characteristics such as input/output lengths, expected latencies, and resource demands. Existing systems struggle with these characteristics, resulting in issues like severe queuing delays, poor tail latencies, and SLO violations. Llumnix introduces runtime rescheduling across multiple model instances to improve load balancing, isolation, and resource management. It uses an efficient live migration mechanism for requests and their in-memory states, enabling dynamic scheduling policies that unify multiple rescheduling scenarios. Evaluations show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5×, and achieves up to 36% cost savings while maintaining similar tail latencies compared to state-of-the-art systems. Llumnix is publicly available at https://github.com/AlibabaPAI/llumnix. The system addresses key challenges such as isolation, fragmentation, and prioritization by rescheduling requests dynamically. It uses a distributed scheduling architecture and a dynamic scheduling policy based on virtual usage to unify different rescheduling goals. Llumnix supports vLLM as the underlying engine and has been evaluated on a 16-GPU cluster, showing significant improvements in prefill and decode latencies, as well as cost savings. The system's design enables efficient and scalable rescheduling, reducing downtime and improving overall performance.Llumnix is a dynamic scheduling system for large language model (LLM) serving that addresses the challenges of heterogeneous and unpredictable workloads. LLMs are used in diverse applications, leading to varying request characteristics such as input/output lengths, expected latencies, and resource demands. Existing systems struggle with these characteristics, resulting in issues like severe queuing delays, poor tail latencies, and SLO violations. Llumnix introduces runtime rescheduling across multiple model instances to improve load balancing, isolation, and resource management. It uses an efficient live migration mechanism for requests and their in-memory states, enabling dynamic scheduling policies that unify multiple rescheduling scenarios. Evaluations show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5×, and achieves up to 36% cost savings while maintaining similar tail latencies compared to state-of-the-art systems. Llumnix is publicly available at https://github.com/AlibabaPAI/llumnix. The system addresses key challenges such as isolation, fragmentation, and prioritization by rescheduling requests dynamically. It uses a distributed scheduling architecture and a dynamic scheduling policy based on virtual usage to unify different rescheduling goals. Llumnix supports vLLM as the underlying engine and has been evaluated on a 16-GPU cluster, showing significant improvements in prefill and decode latencies, as well as cost savings. The system's design enables efficient and scalable rescheduling, reducing downtime and improving overall performance.
Reach us at info@study.space
Understanding Llumnix%3A Dynamic Scheduling for Large Language Model Serving