30 Jun 2024 | Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo
This paper addresses the inefficiencies in multi-turn conversations of large language models (LLMs) by proposing CachedAttention, a new attention mechanism that enables the reuse of key-value (KV) caches across multiple turns. The main challenges include high KV cache access overheads, high storage capacity requirements, suitable placement of KV caches in different hierarchies, and unexpected invalidation of saved KV caches due to context window overflow. To mitigate these challenges, CachedAttention employs a hierarchical KV caching system, including host memory and disks, and uses layer-wise pre-loading, asynchronous saving, scheduler-aware fetching, and decoupled positional encoding to optimize KV cache access and management. Experimental results on real datasets demonstrate that CachedAttention significantly reduces the time to the first token (TFFT) by up to 87%, improves prompt prefilling throughput by up to 7.8×, and reduces end-to-end inference cost by up to 70%.This paper addresses the inefficiencies in multi-turn conversations of large language models (LLMs) by proposing CachedAttention, a new attention mechanism that enables the reuse of key-value (KV) caches across multiple turns. The main challenges include high KV cache access overheads, high storage capacity requirements, suitable placement of KV caches in different hierarchies, and unexpected invalidation of saved KV caches due to context window overflow. To mitigate these challenges, CachedAttention employs a hierarchical KV caching system, including host memory and disks, and uses layer-wise pre-loading, asynchronous saving, scheduler-aware fetching, and decoupled positional encoding to optimize KV cache access and management. Experimental results on real datasets demonstrate that CachedAttention significantly reduces the time to the first token (TFFT) by up to 87%, improves prompt prefilling throughput by up to 7.8×, and reduces end-to-end inference cost by up to 70%.