ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference

ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference

April 27-May 1, 2024 | Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo
ExeGPT is a distributed system designed for constraint-aware LLM inference. It optimizes inference throughput while satisfying latency constraints by efficiently allocating resources and determining optimal execution configurations, including batch sizes and partial tensor parallelism. Two scheduling strategies, Round-Robin Allocation (RRA) and Workload-Aware Allocation (WAA), are introduced to handle different NLP workloads. ExeGPT achieves up to 15.2× improvements in throughput and 6× improvements in latency compared to FasterTransformer. It also shows an average throughput gain of 2.9× across twenty evaluation scenarios. ExeGPT adapts to changing sequence distributions with minimal scheduling adjustment costs. The system leverages sequence length distributions to determine optimal schedules and uses a branch-and-bound method for efficient scheduling. ExeGPT is evaluated on six LLM instances and five NLP tasks, demonstrating significant improvements in throughput and latency. The system provides two novel scheduling strategies with four control variables, enabling flexible trade-offs between throughput and latency. ExeGPT is effective for optimizing and executing LLM inference for diverse NLP workloads and serving conditions.ExeGPT is a distributed system designed for constraint-aware LLM inference. It optimizes inference throughput while satisfying latency constraints by efficiently allocating resources and determining optimal execution configurations, including batch sizes and partial tensor parallelism. Two scheduling strategies, Round-Robin Allocation (RRA) and Workload-Aware Allocation (WAA), are introduced to handle different NLP workloads. ExeGPT achieves up to 15.2× improvements in throughput and 6× improvements in latency compared to FasterTransformer. It also shows an average throughput gain of 2.9× across twenty evaluation scenarios. ExeGPT adapts to changing sequence distributions with minimal scheduling adjustment costs. The system leverages sequence length distributions to determine optimal schedules and uses a branch-and-bound method for efficient scheduling. ExeGPT is evaluated on six LLM instances and five NLP tasks, demonstrating significant improvements in throughput and latency. The system provides two novel scheduling strategies with four control variables, enabling flexible trade-offs between throughput and latency. ExeGPT is effective for optimizing and executing LLM inference for diverse NLP workloads and serving conditions.
Reach us at info@study.space