Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

8 May 2024 | Vikranth Srivatsa*, Zijian He*, Reyna Abhyankar, Dongming Li, Yijing Zhang
Preble is a distributed LLM serving platform that optimizes prompt sharing. The paper introduces E2, a scheduling algorithm that dynamically adapts request and state scheduling based on GPU load and prompt-sharing features. E2 balances computation reuse and load balancing by exploiting cached prompt prefixes and exploring new GPUs when necessary. Preble outperforms existing systems in terms of latency, with average and p99 latency improvements of 1.5× to 14.5× and 2× to 10×, respectively. The system is designed for long and shared prompts, with a global scheduler and per-GPU local scheduler. Preble supports autoscaling and improves memory efficiency and fairness over SGLang. The paper evaluates Preble on two open-source LLMs and two GPU clusters, showing significant performance gains. Key insights include the importance of prompt sharing, the need for efficient scheduling, and the challenges of balancing prefill and decoding computation. Preble's design addresses these challenges through a combination of scheduling strategies and efficient resource management.Preble is a distributed LLM serving platform that optimizes prompt sharing. The paper introduces E2, a scheduling algorithm that dynamically adapts request and state scheduling based on GPU load and prompt-sharing features. E2 balances computation reuse and load balancing by exploiting cached prompt prefixes and exploring new GPUs when necessary. Preble outperforms existing systems in terms of latency, with average and p99 latency improvements of 1.5× to 14.5× and 2× to 10×, respectively. The system is designed for long and shared prompts, with a global scheduler and per-GPU local scheduler. Preble supports autoscaling and improves memory efficiency and fairness over SGLang. The paper evaluates Preble on two open-source LLMs and two GPU clusters, showing significant performance gains. Key insights include the importance of prompt sharing, the need for efficient scheduling, and the challenges of balancing prefill and decoding computation. Preble's design addresses these challenges through a combination of scheduling strategies and efficient resource management.
Reach us at info@study.space
Understanding Preble%3A Efficient Distributed Prompt Scheduling for LLM Serving