Understanding Inference without Interference%3A Disaggregate LLM Inference for Mixed Downstream Workloads

The paper "Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads" addresses the issue of interference in large language model (LLM) inference serving, particularly when handling mixed downstream workloads with varying input and output token lengths. The authors propose TetriInfer, a system designed to mitigate these interferences by carefully scheduling and grouping inference requests based on their characteristics. Key contributions of TetriInfer include: 1. **Fixed-Size Chunks for Prefill**: Input prompts are partitioned into fixed-size chunks to ensure the accelerator runs close to its computation-saturated limit. 2. **Disaggregation of Prefill and Decode**: Prefill and decode instances are run independently to avoid interference between the two phases. 3. **Smart Two-Level Scheduling**: A scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. The paper demonstrates that TetriInfer significantly improves performance metrics such as time-to-first-token (TTFT), job completion time (JCT), and inference efficiency. Specifically, TetriInfer reduces average TTFT and JCT by 97% and 47%, respectively, while using 38% less resources compared to a naive approach. The system is effective across various downstream tasks, including summarization, content creation, and chat, with notable improvements in mixed workload scenarios. The evaluation section compares TetriInfer with a baseline system, vLLM, using public datasets and emulated network bandwidth. Results show that TetriInfer outperforms vLLM in terms of performance per dollar, with significant improvements in TTFT and JCT for most workloads. However, the system's design is less effective for heavy prefetch and heavy decode workloads due to limited room for improvement and introduced overhead.The paper "Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads" addresses the issue of interference in large language model (LLM) inference serving, particularly when handling mixed downstream workloads with varying input and output token lengths. The authors propose TetriInfer, a system designed to mitigate these interferences by carefully scheduling and grouping inference requests based on their characteristics. Key contributions of TetriInfer include: 1. **Fixed-Size Chunks for Prefill**: Input prompts are partitioned into fixed-size chunks to ensure the accelerator runs close to its computation-saturated limit. 2. **Disaggregation of Prefill and Decode**: Prefill and decode instances are run independently to avoid interference between the two phases. 3. **Smart Two-Level Scheduling**: A scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. The paper demonstrates that TetriInfer significantly improves performance metrics such as time-to-first-token (TTFT), job completion time (JCT), and inference efficiency. Specifically, TetriInfer reduces average TTFT and JCT by 97% and 47%, respectively, while using 38% less resources compared to a naive approach. The system is effective across various downstream tasks, including summarization, content creation, and chat, with notable improvements in mixed workload scenarios. The evaluation section compares TetriInfer with a baseline system, vLLM, using public datasets and emulated network bandwidth. Results show that TetriInfer outperforms vLLM in terms of performance per dollar, with significant improvements in TTFT and JCT for most workloads. However, the system's design is less effective for heavy prefetch and heavy decode workloads due to limited room for improvement and introduced overhead.

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

20 Jan 2024 | Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan