This paper addresses the challenges of deploying long-context transformers, which are crucial for emerging AI applications such as video understanding and project-level coding. The primary goal is to reduce the deployment cost of 1 million token context to be as cheap as 4,000 tokens. The paper highlights that the main cost driver is the size of the Key-Value (KV) cache, which significantly impacts four key performance metrics: concurrency, prefilling latency, decoding latency, and context switching overhead.
The authors propose a concurrent programming framework to analyze these metrics under limited GPU High-Bandwidth Memory (HBM) conditions. They use a 34B GPT-3.5 model with 50K context as a case study and identify four main deployment challenges:
1. **Prefilling**: Long inputs take much longer to prefill and consume more GPU memory compared to short inputs.
2. **Concurrency**: The large KV cache restricts the number of concurrent users that can be served.
3. **Decoding**: Repeatedly reading the KV cache from HBM to the Stream Multiprocessor (SM) increases latency.
4. **Context Switching**: Swapping the KV cache from HBM to DDR causes significant context switching latency.
The paper also discusses how hardware architectures, tensor parallelism, and different types of attention (e.g., Group Query Attention (GQA) vs. Multi-Head Attention (MHA)) influence these metrics. It concludes by identifying research directions, such as lossless compression of the KV cache and integrating existing optimization techniques into an end-to-end system, to reduce the deployment cost of long-context transformers.This paper addresses the challenges of deploying long-context transformers, which are crucial for emerging AI applications such as video understanding and project-level coding. The primary goal is to reduce the deployment cost of 1 million token context to be as cheap as 4,000 tokens. The paper highlights that the main cost driver is the size of the Key-Value (KV) cache, which significantly impacts four key performance metrics: concurrency, prefilling latency, decoding latency, and context switching overhead.
The authors propose a concurrent programming framework to analyze these metrics under limited GPU High-Bandwidth Memory (HBM) conditions. They use a 34B GPT-3.5 model with 50K context as a case study and identify four main deployment challenges:
1. **Prefilling**: Long inputs take much longer to prefill and consume more GPU memory compared to short inputs.
2. **Concurrency**: The large KV cache restricts the number of concurrent users that can be served.
3. **Decoding**: Repeatedly reading the KV cache from HBM to the Stream Multiprocessor (SM) increases latency.
4. **Context Switching**: Swapping the KV cache from HBM to DDR causes significant context switching latency.
The paper also discusses how hardware architectures, tensor parallelism, and different types of attention (e.g., Group Query Attention (GQA) vs. Multi-Head Attention (MHA)) influence these metrics. It concludes by identifying research directions, such as lossless compression of the KV cache and integrating existing optimization techniques into an end-to-end system, to reduce the deployment cost of long-context transformers.