Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

8 May 2024 | Vikranth Srivatsa*, Zijian He*, Reyna Abhyankar, Dongming Li, Yiying Zhang
**Preble: Efficient Distributed Prompt Scheduling for LLM Serving** This paper introduces Preble, a distributed LLM serving platform designed to optimize prompt sharing and computation reuse. Traditional LLM serving systems treat each request independently, missing opportunities for computation reuse. Preble addresses this by proposing a distributed scheduling system that co-optimizes computation reuse and load balancing. The system is evaluated on five popular LLM workloads using real workloads and request arrival patterns on two open-source LLM models, showing significant improvements in latency compared to state-of-the-art systems. **Key Contributions:** 1. **Study of LLM Workloads:** Conducted a comprehensive study on five real LLM workloads and a datacenter LLM request trace to understand prompt and request load features. 2. **Challenges in Distributed LLM Serving:** Identified three key challenges in distributed LLM serving under long and shared prompts. 3. **E2 Scheduling Algorithm:** Developed a new LLM request scheduling algorithm that integrates exploitation and exploration to dynamically adapt request and state scheduling based on GPU load and prompt-sharing features. 4. **Preble System Design:** Designed Preble, a distributed LLM serving system that targets long and shared prompts, featuring a global scheduler and a per-GPU local scheduler. **Evaluation:** - Preble outperforms state-of-the-art systems by 1.5× to 14.5× in average latency and 2× to 10× in p99 latency. - The system is evaluated on two open-source LLM models (Mistral 7B and Llama-3 70B) using real workloads and request arrival patterns on two GPU clusters (four NVidia A6000 and eight NVidia H100). **Contributions:** - First study of LLM workloads with long and shared prompts, leading to four key insights. - Identification of three new challenges in distributed LLM serving under long and shared prompts. - Development of E2, a novel LLM request scheduling algorithm. - Implementation and evaluation of Preble, the first distributed LLM serving system targeting long and shared prompts. **Future Work:** - Open-sourcing Preble upon acceptance. - Further research on improving the scalability and efficiency of the global scheduler.**Preble: Efficient Distributed Prompt Scheduling for LLM Serving** This paper introduces Preble, a distributed LLM serving platform designed to optimize prompt sharing and computation reuse. Traditional LLM serving systems treat each request independently, missing opportunities for computation reuse. Preble addresses this by proposing a distributed scheduling system that co-optimizes computation reuse and load balancing. The system is evaluated on five popular LLM workloads using real workloads and request arrival patterns on two open-source LLM models, showing significant improvements in latency compared to state-of-the-art systems. **Key Contributions:** 1. **Study of LLM Workloads:** Conducted a comprehensive study on five real LLM workloads and a datacenter LLM request trace to understand prompt and request load features. 2. **Challenges in Distributed LLM Serving:** Identified three key challenges in distributed LLM serving under long and shared prompts. 3. **E2 Scheduling Algorithm:** Developed a new LLM request scheduling algorithm that integrates exploitation and exploration to dynamically adapt request and state scheduling based on GPU load and prompt-sharing features. 4. **Preble System Design:** Designed Preble, a distributed LLM serving system that targets long and shared prompts, featuring a global scheduler and a per-GPU local scheduler. **Evaluation:** - Preble outperforms state-of-the-art systems by 1.5× to 14.5× in average latency and 2× to 10× in p99 latency. - The system is evaluated on two open-source LLM models (Mistral 7B and Llama-3 70B) using real workloads and request arrival patterns on two GPU clusters (four NVidia A6000 and eight NVidia H100). **Contributions:** - First study of LLM workloads with long and shared prompts, leading to four key insights. - Identification of three new challenges in distributed LLM serving under long and shared prompts. - Development of E2, a novel LLM request scheduling algorithm. - Implementation and evaluation of Preble, the first distributed LLM serving system targeting long and shared prompts. **Future Work:** - Open-sourcing Preble upon acceptance. - Further research on improving the scalability and efficiency of the global scheduler.
Reach us at info@study.space