FASTDECODE is a high-throughput GPU-efficient LLM serving system that uses heterogeneous pipelines to improve performance. Large language models (LLMs) are expensive to serve due to the high cost of GPUs, which are inefficient for sequential token generation. However, increasing the batch size can improve efficiency, but it is limited by the memory footprint of the KV-cache, which stores intermediate results. Offloading the KV-cache to host memory is not feasible due to CPU-GPU bandwidth limitations.
The key insight of FASTDECODE is to decompose the transformer model into two parts: R-Part, which is memory-bound and includes the KV-cache, and S-Part, which is more compute-intensive. R-Part is processed on CPUs, while S-Part is processed on GPUs. This approach reduces data transmission overhead and increases GPU throughput. The system also addresses efficiency challenges by using scheduling and performance modeling techniques to balance the workload between CPUs and GPUs.
Evaluation results show that FASTDECODE achieves 1.88×-5.04× higher throughput than vLLM when serving modern LLMs with the same GPU. The system uses multiple out-of-chassis remote CPUs for KV-cache and related computation, which increases the aggregated memory capacity and bandwidth. A sequence-level load-stabilizing schedule is used to balance the workload and improve throughput. The system also uses a model-guided approach to orchestrate the GPU with CPUs, which helps to optimize performance.
FASTDECODE is a CPU-GPU heterogeneous pipeline for LLM inference that addresses the challenges of high-throughput serving. The system uses a sequence-level load-stabilizing schedule to balance the workload and improve throughput. The system also uses a model-guided approach to orchestrate the GPU with CPUs, which helps to optimize performance. The system is evaluated on various models and shows significant improvements in throughput and latency compared to existing systems. The system is scalable and can handle large batch sizes, making it an efficient solution for serving LLMs.FASTDECODE is a high-throughput GPU-efficient LLM serving system that uses heterogeneous pipelines to improve performance. Large language models (LLMs) are expensive to serve due to the high cost of GPUs, which are inefficient for sequential token generation. However, increasing the batch size can improve efficiency, but it is limited by the memory footprint of the KV-cache, which stores intermediate results. Offloading the KV-cache to host memory is not feasible due to CPU-GPU bandwidth limitations.
The key insight of FASTDECODE is to decompose the transformer model into two parts: R-Part, which is memory-bound and includes the KV-cache, and S-Part, which is more compute-intensive. R-Part is processed on CPUs, while S-Part is processed on GPUs. This approach reduces data transmission overhead and increases GPU throughput. The system also addresses efficiency challenges by using scheduling and performance modeling techniques to balance the workload between CPUs and GPUs.
Evaluation results show that FASTDECODE achieves 1.88×-5.04× higher throughput than vLLM when serving modern LLMs with the same GPU. The system uses multiple out-of-chassis remote CPUs for KV-cache and related computation, which increases the aggregated memory capacity and bandwidth. A sequence-level load-stabilizing schedule is used to balance the workload and improve throughput. The system also uses a model-guided approach to orchestrate the GPU with CPUs, which helps to optimize performance.
FASTDECODE is a CPU-GPU heterogeneous pipeline for LLM inference that addresses the challenges of high-throughput serving. The system uses a sequence-level load-stabilizing schedule to balance the workload and improve throughput. The system also uses a model-guided approach to orchestrate the GPU with CPUs, which helps to optimize performance. The system is evaluated on various models and shows significant improvements in throughput and latency compared to existing systems. The system is scalable and can handle large batch sizes, making it an efficient solution for serving LLMs.