[slides and audio] Llumnix%3A Dynamic Scheduling for Large Language Model Serving

**Llumix: Dynamic Scheduling for Large Language Model Serving** **Authors:** Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin **Affiliation:** Alibaba Group **Abstract:** Inference serving for large language models (LLMs) is crucial for their practical applications, but efficient serving remains challenging due to the heterogeneous and unpredictable nature of requests. Existing systems struggle with severe queuing delays, poor tail latencies, and SLO violations. Llumix introduces a runtime rescheduling mechanism across multiple model instances to address these issues. Similar to context switching in modern operating systems, Llumix improves load balancing, isolation, and resource fragmentation by rescheduling requests. It employs an efficient live migration mechanism for requests and their in-memory states, achieving near-zero downtime. A dynamic scheduling policy unifies various rescheduling scenarios. Evaluations show that Llumix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5×, and achieves 36% cost savings while maintaining similar tail latencies compared to state-of-the-art systems. **Key Contributions:** - Reveals unique characteristics and scheduling challenges of LLM serving. - Proposes request rescheduling as a key measure. - Designs a distributed scheduling architecture and a heuristic scheduling policy. - Implements and evaluates Llumix, demonstrating its advantages over state-of-the-art systems. **Motivation:** - Unpredictable memory demands and preemptions. - Performance interference among requests. - Memory fragmentation. - Different emergency and priorities of requests. **Design:** - **Live Migration:** Utilizes the append-only characteristic of KV cache to minimize downtime. - **Distributed Scheduling Architecture:** Combines global and local schedulers for efficient continuous rescheduling. - **Dynamic Scheduling Policy:** Uses virtual usage to unify different scheduling goals. **Evaluation:** - **Migration Efficiency:** Near-zero downtime and negligible overhead. - **Serving Performance:** Significant improvements in latency and cost savings compared to baselines. **Conclusion:** Llumix addresses the challenges of LLM serving by dynamically rescheduling requests, improving performance and efficiency.**Llumix: Dynamic Scheduling for Large Language Model Serving** **Authors:** Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin **Affiliation:** Alibaba Group **Abstract:** Inference serving for large language models (LLMs) is crucial for their practical applications, but efficient serving remains challenging due to the heterogeneous and unpredictable nature of requests. Existing systems struggle with severe queuing delays, poor tail latencies, and SLO violations. Llumix introduces a runtime rescheduling mechanism across multiple model instances to address these issues. Similar to context switching in modern operating systems, Llumix improves load balancing, isolation, and resource fragmentation by rescheduling requests. It employs an efficient live migration mechanism for requests and their in-memory states, achieving near-zero downtime. A dynamic scheduling policy unifies various rescheduling scenarios. Evaluations show that Llumix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5×, and achieves 36% cost savings while maintaining similar tail latencies compared to state-of-the-art systems. **Key Contributions:** - Reveals unique characteristics and scheduling challenges of LLM serving. - Proposes request rescheduling as a key measure. - Designs a distributed scheduling architecture and a heuristic scheduling policy. - Implements and evaluates Llumix, demonstrating its advantages over state-of-the-art systems. **Motivation:** - Unpredictable memory demands and preemptions. - Performance interference among requests. - Memory fragmentation. - Different emergency and priorities of requests. **Design:** - **Live Migration:** Utilizes the append-only characteristic of KV cache to minimize downtime. - **Distributed Scheduling Architecture:** Combines global and local schedulers for efficient continuous rescheduling. - **Dynamic Scheduling Policy:** Uses virtual usage to unify different scheduling goals. **Evaluation:** - **Migration Efficiency:** Near-zero downtime and negligible overhead. - **Serving Performance:** Significant improvements in latency and cost savings compared to baselines. **Conclusion:** Llumix addresses the challenges of LLM serving by dynamically rescheduling requests, improving performance and efficiency.

Llumnix: Dynamic Scheduling for Large Language Model Serving

5 Jun 2024 | Biao Sun,†, Ziming Huang,†, Hanyu Zhao*, Wencong Xiao, Xinyi Zhang†, Yong Li, Wei Lin

Llumnix: Dynamic Scheduling for Large Language Model Serving

5 Jun 2024 | Biao Sun*,†, Ziming Huang*,†, Hanyu Zhao*, Wencong Xiao, Xinyi Zhang†, Yong Li, Wei Lin

5 Jun 2024 | Biao Sun,†, Ziming Huang,†, Hanyu Zhao*, Wencong Xiao, Xinyi Zhang†, Yong Li, Wei Lin