TetriInfer is a cloud-scale LLM inference serving system designed to minimize interference between different types of inference requests. LLM inference consists of a prefill phase and a decode phase, but existing deployment practices often overlook their distinct characteristics, leading to significant interference. TetriInfer addresses this by carefully scheduling and grouping requests based on their characteristics through three pillars: (1) partitioning prompts into fixed-size chunks to keep the accelerator near its computation-saturated limit; (2) disaggregating prefill and decode phases so they can run independently; and (3) using a smart two-level scheduling algorithm with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer significantly improves time-to-first-token (TTFT), job completion time (JCT), and performance per dollar by using 38% less resources while lowering average TTFT and JCT by 97% and 47%, respectively.
LLM inference is used in various downstream tasks, such as document summarization and content creation, which have different properties. These tasks can be categorized along two dimensions: prefill prompt length and decode token length. Mixing inference requests of different types leads to significant interference, as shown in experiments. For example, mixing prefill requests can result in a 10x slowdown, while mixing decode requests can lead to a 16% throughput hit. A naive solution to avoid interference is to provision resources for each task statically, but this is impractical due to the high cost of LLM serving infrastructure. TetriInfer addresses this by disaggregating prefill and decode phases into separate instances, allowing them to run independently. This reduces interference and improves performance.
TetriInfer's design includes a centralized control plane, prefill and decode instances, and a length prediction model. The control plane manages the lifecycle of instances and schedules requests based on load. Prefill instances run the prefill phase and partition prompts into fixed-size chunks to keep the accelerator near its computation-saturated limit. Decode instances run the decode phase and use a smart two-level scheduling algorithm to avoid scheduling hotspots. The length prediction model predicts the number of generated tokens for decode requests, allowing them to be scheduled accordingly.
TetriInfer is implemented based on vLLM, with most modules implemented in Python, except for the network stack, which uses C++ to interface with low-level APIs. The system is tested on a real testbed with emulated network bandwidth ranging from 200Gbps to 300GBps. Results show that TetriInfer improves performance per dollar by 2.4x for light prefill and heavy decode workloads, and by 1.9x for mixed workloads. However, TetriInfer's design is not ideal for heavy prefill and heavy decode workloads, as the room for improvement is marginal, and the overhead introduced cannot be offset. Overall, TetriInferTetriInfer is a cloud-scale LLM inference serving system designed to minimize interference between different types of inference requests. LLM inference consists of a prefill phase and a decode phase, but existing deployment practices often overlook their distinct characteristics, leading to significant interference. TetriInfer addresses this by carefully scheduling and grouping requests based on their characteristics through three pillars: (1) partitioning prompts into fixed-size chunks to keep the accelerator near its computation-saturated limit; (2) disaggregating prefill and decode phases so they can run independently; and (3) using a smart two-level scheduling algorithm with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer significantly improves time-to-first-token (TTFT), job completion time (JCT), and performance per dollar by using 38% less resources while lowering average TTFT and JCT by 97% and 47%, respectively.
LLM inference is used in various downstream tasks, such as document summarization and content creation, which have different properties. These tasks can be categorized along two dimensions: prefill prompt length and decode token length. Mixing inference requests of different types leads to significant interference, as shown in experiments. For example, mixing prefill requests can result in a 10x slowdown, while mixing decode requests can lead to a 16% throughput hit. A naive solution to avoid interference is to provision resources for each task statically, but this is impractical due to the high cost of LLM serving infrastructure. TetriInfer addresses this by disaggregating prefill and decode phases into separate instances, allowing them to run independently. This reduces interference and improves performance.
TetriInfer's design includes a centralized control plane, prefill and decode instances, and a length prediction model. The control plane manages the lifecycle of instances and schedules requests based on load. Prefill instances run the prefill phase and partition prompts into fixed-size chunks to keep the accelerator near its computation-saturated limit. Decode instances run the decode phase and use a smart two-level scheduling algorithm to avoid scheduling hotspots. The length prediction model predicts the number of generated tokens for decode requests, allowing them to be scheduled accordingly.
TetriInfer is implemented based on vLLM, with most modules implemented in Python, except for the network stack, which uses C++ to interface with low-level APIs. The system is tested on a real testbed with emulated network bandwidth ranging from 200Gbps to 300GBps. Results show that TetriInfer improves performance per dollar by 2.4x for light prefill and heavy decode workloads, and by 1.9x for mixed workloads. However, TetriInfer's design is not ideal for heavy prefill and heavy decode workloads, as the room for improvement is marginal, and the overhead introduced cannot be offset. Overall, TetriInfer