[slides and audio] Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

The paper "Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction" addresses the challenge of efficiently serving inference requests from large language models (LLMs) due to their unpredictable execution times. Traditional first-come-first-serve (FCFS) scheduling in LLM serving systems suffers from head-of-line blocking issues, leading to long job completion times (JCT) and low throughput. To mitigate these issues, the authors propose a speculative shortest-job-first (SSJF) scheduler that uses a lightweight proxy model to predict the output sequence lengths of LLMs. This approach does not require changes to memory management or batching strategies and can be applied to various batching settings, including no batching, dynamic batching, and continuous batching. The key contributions of the paper include: 1. **SSJF Scheduler**: A speculative scheduler that leverages a proxy model to predict output sequence lengths, improving JCT by 30.5–39.6% and increasing throughput by 2.2–3.6× compared to FCFS. 2. **Open-source Implementation**: An open-source implementation of SSJF, evaluated on real-world datasets and production workload traces. 3. **Proxy Model Architecture**: A BERT-base model fine-tuned to predict output token lengths, achieving high prediction accuracy. Evaluations on real-world datasets and production workload traces show that SSJF significantly reduces JCT and improves throughput across different batching settings. The paper also discusses potential use cases for proxy models in LLM serving, such as memory allocation, caching, and server resource allocation.The paper "Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction" addresses the challenge of efficiently serving inference requests from large language models (LLMs) due to their unpredictable execution times. Traditional first-come-first-serve (FCFS) scheduling in LLM serving systems suffers from head-of-line blocking issues, leading to long job completion times (JCT) and low throughput. To mitigate these issues, the authors propose a speculative shortest-job-first (SSJF) scheduler that uses a lightweight proxy model to predict the output sequence lengths of LLMs. This approach does not require changes to memory management or batching strategies and can be applied to various batching settings, including no batching, dynamic batching, and continuous batching. The key contributions of the paper include: 1. **SSJF Scheduler**: A speculative scheduler that leverages a proxy model to predict output sequence lengths, improving JCT by 30.5–39.6% and increasing throughput by 2.2–3.6× compared to FCFS. 2. **Open-source Implementation**: An open-source implementation of SSJF, evaluated on real-world datasets and production workload traces. 3. **Proxy Model Architecture**: A BERT-base model fine-tuned to predict output token lengths, achieving high prediction accuracy. Evaluations on real-world datasets and production workload traces show that SSJF significantly reduces JCT and improves throughput across different batching settings. The paper also discusses potential use cases for proxy models in LLM serving, such as memory allocation, caching, and server resource allocation.

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

12 Apr 2024 | Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, Ravishankar Iyer