April 27-May 1, 2024 | Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo
ExeGPT is a distributed system designed for constraint-aware large language model (LLM) inference, aiming to maximize inference throughput while satisfying given latency constraints. The system leverages the distribution of input and output sequence lengths to efficiently allocate resources and determine optimal execution configurations, including batch sizes and partial tensor parallelism. Two scheduling strategies—Round-Robin Allocation (RRA) and Workload-Aware Allocation (WAA)—are introduced, each with four control variables to balance throughput and latency. ExeGPT is evaluated on six LLM instances (T5, OPT, and GPT-3) and five NLP tasks, achieving up to 15.2× improvements in throughput and 6× improvements in latency compared to FasterTransformer. The system demonstrates effective optimization and execution of LLM inference for diverse NLP workloads and serving conditions.ExeGPT is a distributed system designed for constraint-aware large language model (LLM) inference, aiming to maximize inference throughput while satisfying given latency constraints. The system leverages the distribution of input and output sequence lengths to efficiently allocate resources and determine optimal execution configurations, including batch sizes and partial tensor parallelism. Two scheduling strategies—Round-Robin Allocation (RRA) and Workload-Aware Allocation (WAA)—are introduced, each with four control variables to balance throughput and latency. ExeGPT is evaluated on six LLM instances (T5, OPT, and GPT-3) and five NLP tasks, achieving up to 15.2× improvements in throughput and 6× improvements in latency compared to FasterTransformer. The system demonstrates effective optimization and execution of LLM inference for diverse NLP workloads and serving conditions.