FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

18 Mar 2024 | Jiaao He, Jidong Zhai
FASTDECODE is a high-throughput GPU-efficient system for serving large language models (LLMs) using heterogeneous pipelines. The primary challenge in serving LLMs is the high cost and limited availability of GPUs, which are inefficient when generating tokens sequentially. The batch size is limited by the size of the KV-Cache, a memory-bound intermediate result that occupies significant GPU memory. To address this, FASTDECODE decomposes the transformer model into two parts: R-Part, which includes the KV-Cache, and S-Part, which does not. The key insight is to process the KV-Cache near the CPU, leveraging the aggregated memory capacity and bandwidth of CPUs across multiple nodes. This approach reduces data transmission overhead and boosts GPU throughput. The system also addresses heterogeneity challenges at both temporal and inter-device levels using scheduling and performance modeling techniques. The evaluation shows that FASTDECODE achieves 1.88 × −5.04× the throughput of vLLM when serving modern LLMs with the same GPU, demonstrating significant improvements in token generation throughput and latency. The system's performance is validated through experiments on various LLMs, including Llama and OPT, with different model sizes and hardware configurations.FASTDECODE is a high-throughput GPU-efficient system for serving large language models (LLMs) using heterogeneous pipelines. The primary challenge in serving LLMs is the high cost and limited availability of GPUs, which are inefficient when generating tokens sequentially. The batch size is limited by the size of the KV-Cache, a memory-bound intermediate result that occupies significant GPU memory. To address this, FASTDECODE decomposes the transformer model into two parts: R-Part, which includes the KV-Cache, and S-Part, which does not. The key insight is to process the KV-Cache near the CPU, leveraging the aggregated memory capacity and bandwidth of CPUs across multiple nodes. This approach reduces data transmission overhead and boosts GPU throughput. The system also addresses heterogeneity challenges at both temporal and inter-device levels using scheduling and performance modeling techniques. The evaluation shows that FASTDECODE achieves 1.88 × −5.04× the throughput of vLLM when serving modern LLMs with the same GPU, demonstrating significant improvements in token generation throughput and latency. The system's performance is validated through experiments on various LLMs, including Llama and OPT, with different model sizes and hardware configurations.
Reach us at info@study.space