Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

30 Jun 2024 | Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo
CachedAttention is a novel attention mechanism designed to reduce the computational overhead of key-value (KV) cache recomputation in multi-turn conversations for large language models (LLMs). The main challenge is the high cost of repeatedly computing KV caches for historical tokens during multi-turn conversations, which leads to significant serving costs. CachedAttention addresses this by enabling the reuse of KV caches across conversations, significantly reducing the need for recomputation. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap KV cache access with GPU computation. Scheduler-aware fetching and eviction schemes are used to ensure that KV caches are placed in the fastest hierarchy based on the hints from the inference job scheduler. To avoid invalidation of saved KV caches due to context window overflow, CachedAttention decouples the positional encoding from the KV caches, allowing direct truncation of the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8× for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%. The paper makes several key contributions: investigating the recomputation overheads of KV caches in LLMs across conversation turns, proposing CachedAttention, designing overlapped KV cache access, hierarchical KV cache placement, and positional encoding decoupled KV cache truncation schemes, and thoroughly evaluating CachedAttention with real datasets to demonstrate its efficacy and efficiency.CachedAttention is a novel attention mechanism designed to reduce the computational overhead of key-value (KV) cache recomputation in multi-turn conversations for large language models (LLMs). The main challenge is the high cost of repeatedly computing KV caches for historical tokens during multi-turn conversations, which leads to significant serving costs. CachedAttention addresses this by enabling the reuse of KV caches across conversations, significantly reducing the need for recomputation. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap KV cache access with GPU computation. Scheduler-aware fetching and eviction schemes are used to ensure that KV caches are placed in the fastest hierarchy based on the hints from the inference job scheduler. To avoid invalidation of saved KV caches due to context window overflow, CachedAttention decouples the positional encoding from the KV caches, allowing direct truncation of the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8× for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%. The paper makes several key contributions: investigating the recomputation overheads of KV caches in LLMs across conversation turns, proposing CachedAttention, designing overlapped KV cache access, hierarchical KV cache placement, and positional encoding decoupled KV cache truncation schemes, and thoroughly evaluating CachedAttention with real datasets to demonstrate its efficacy and efficiency.
Reach us at info@study.space
[slides] Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | StudySpace