2024-07-01 | Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui
This paper introduces PQCache, a system-algorithm co-designed method for effective and efficient long context LLM inference. The key challenge addressed is the memory bottleneck caused by the Key-Value Cache (KVCache) in Large Language Models (LLMs), which becomes increasingly significant as context lengths grow. The authors propose PQCache, which uses Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, PQ is applied to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, important tokens are identified through Maximum Inner-Product Search (MIPS) using PQ codes and centroids, and the corresponding key-value pairs are fetched for self-attention computation. Through careful design of overlapping and caching, PQCache minimizes additional computation and communication overhead. Extensive experiments show that PQCache achieves both effectiveness and efficiency, maintaining model quality even when only 1/5 of the tokens are involved in attention, while attaining acceptable system latency. The paper also discusses related work, including selective attention methods, KVCache quantization, and embedding management, and concludes with the potential of integrating classic embedding management techniques into the LLM ecosystem.This paper introduces PQCache, a system-algorithm co-designed method for effective and efficient long context LLM inference. The key challenge addressed is the memory bottleneck caused by the Key-Value Cache (KVCache) in Large Language Models (LLMs), which becomes increasingly significant as context lengths grow. The authors propose PQCache, which uses Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, PQ is applied to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, important tokens are identified through Maximum Inner-Product Search (MIPS) using PQ codes and centroids, and the corresponding key-value pairs are fetched for self-attention computation. Through careful design of overlapping and caching, PQCache minimizes additional computation and communication overhead. Extensive experiments show that PQCache achieves both effectiveness and efficiency, maintaining model quality even when only 1/5 of the tokens are involved in attention, while attaining acceptable system latency. The paper also discusses related work, including selective attention methods, KVCache quantization, and embedding management, and concludes with the potential of integrating classic embedding management techniques into the LLM ecosystem.