8 May 2024 | Vikranth Srivatsa*, Zijian He*, Reyna Abhyankar, Dongming Li, Yiying Zhang
**Preble: Efficient Distributed Prompt Scheduling for LLM Serving**
This paper introduces Preble, a distributed LLM serving platform designed to optimize prompt sharing and computation reuse. Traditional LLM serving systems treat each request independently, missing opportunities for computation reuse. Preble addresses this by proposing a distributed scheduling system that co-optimizes computation reuse and load balancing. The system is evaluated on five popular LLM workloads using real workloads and request arrival patterns on two open-source LLM models, showing significant improvements in latency compared to state-of-the-art systems.
**Key Contributions:**
1. **Study of LLM Workloads:** Conducted a comprehensive study on five real LLM workloads and a datacenter LLM request trace to understand prompt and request load features.
2. **Challenges in Distributed LLM Serving:** Identified three key challenges in distributed LLM serving under long and shared prompts.
3. **E2 Scheduling Algorithm:** Developed a new LLM request scheduling algorithm that integrates exploitation and exploration to dynamically adapt request and state scheduling based on GPU load and prompt-sharing features.
4. **Preble System Design:** Designed Preble, a distributed LLM serving system that targets long and shared prompts, featuring a global scheduler and a per-GPU local scheduler.
**Evaluation:**
- Preble outperforms state-of-the-art systems by 1.5× to 14.5× in average latency and 2× to 10× in p99 latency.
- The system is evaluated on two open-source LLM models (Mistral 7B and Llama-3 70B) using real workloads and request arrival patterns on two GPU clusters (four NVidia A6000 and eight NVidia H100).
**Contributions:**
- First study of LLM workloads with long and shared prompts, leading to four key insights.
- Identification of three new challenges in distributed LLM serving under long and shared prompts.
- Development of E2, a novel LLM request scheduling algorithm.
- Implementation and evaluation of Preble, the first distributed LLM serving system targeting long and shared prompts.
**Future Work:**
- Open-sourcing Preble upon acceptance.
- Further research on improving the scalability and efficiency of the global scheduler.**Preble: Efficient Distributed Prompt Scheduling for LLM Serving**
This paper introduces Preble, a distributed LLM serving platform designed to optimize prompt sharing and computation reuse. Traditional LLM serving systems treat each request independently, missing opportunities for computation reuse. Preble addresses this by proposing a distributed scheduling system that co-optimizes computation reuse and load balancing. The system is evaluated on five popular LLM workloads using real workloads and request arrival patterns on two open-source LLM models, showing significant improvements in latency compared to state-of-the-art systems.
**Key Contributions:**
1. **Study of LLM Workloads:** Conducted a comprehensive study on five real LLM workloads and a datacenter LLM request trace to understand prompt and request load features.
2. **Challenges in Distributed LLM Serving:** Identified three key challenges in distributed LLM serving under long and shared prompts.
3. **E2 Scheduling Algorithm:** Developed a new LLM request scheduling algorithm that integrates exploitation and exploration to dynamically adapt request and state scheduling based on GPU load and prompt-sharing features.
4. **Preble System Design:** Designed Preble, a distributed LLM serving system that targets long and shared prompts, featuring a global scheduler and a per-GPU local scheduler.
**Evaluation:**
- Preble outperforms state-of-the-art systems by 1.5× to 14.5× in average latency and 2× to 10× in p99 latency.
- The system is evaluated on two open-source LLM models (Mistral 7B and Llama-3 70B) using real workloads and request arrival patterns on two GPU clusters (four NVidia A6000 and eight NVidia H100).
**Contributions:**
- First study of LLM workloads with long and shared prompts, leading to four key insights.
- Identification of three new challenges in distributed LLM serving under long and shared prompts.
- Development of E2, a novel LLM request scheduling algorithm.
- Implementation and evaluation of Preble, the first distributed LLM serving system targeting long and shared prompts.
**Future Work:**
- Open-sourcing Preble upon acceptance.
- Further research on improving the scalability and efficiency of the global scheduler.