Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

12 Apr 2024 | Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer BaŞsar, Ravishankar Iyer
This paper presents a speculative shortest-job-first (SSJF) scheduler for efficient interactive large language model (LLM) serving. SSJF uses a lightweight proxy model to predict the output sequence length of LLMs, enabling more efficient scheduling. Unlike traditional first-come-first-serve (FCFS) scheduling, which suffers from head-of-line blocking, SSJF prioritizes shorter jobs, reducing average job completion times (JCT) and increasing throughput. Evaluations on real-world datasets and production workload traces show that SSJF reduces JCT by 30.5–39.6% and increases throughput by 2.2–3.6× compared to FCFS schedulers across no batching, dynamic batching, and continuous batching settings. The SSJF scheduler is implemented as an open-source system and supports various LLM inference scenarios, including multi-round conversations. The proxy model used for sequence length prediction is a fine-tuned BERT-base model, which achieves high prediction accuracy and scheduling performance. The paper also discusses potential applications of proxy models in LLM serving beyond request scheduling. The results demonstrate that SSJF significantly improves the efficiency of LLM serving while maintaining low overhead.This paper presents a speculative shortest-job-first (SSJF) scheduler for efficient interactive large language model (LLM) serving. SSJF uses a lightweight proxy model to predict the output sequence length of LLMs, enabling more efficient scheduling. Unlike traditional first-come-first-serve (FCFS) scheduling, which suffers from head-of-line blocking, SSJF prioritizes shorter jobs, reducing average job completion times (JCT) and increasing throughput. Evaluations on real-world datasets and production workload traces show that SSJF reduces JCT by 30.5–39.6% and increases throughput by 2.2–3.6× compared to FCFS schedulers across no batching, dynamic batching, and continuous batching settings. The SSJF scheduler is implemented as an open-source system and supports various LLM inference scenarios, including multi-round conversations. The proxy model used for sequence length prediction is a fine-tuned BERT-base model, which achieves high prediction accuracy and scheduling performance. The paper also discusses potential applications of proxy models in LLM serving beyond request scheduling. The results demonstrate that SSJF significantly improves the efficiency of LLM serving while maintaining low overhead.
Reach us at info@study.space
[slides] Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | StudySpace