1 Jul 2024 | Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui
PQCache is a product quantization-based KVCache for long context LLM inference. The paper introduces PQCache, a system-algorithm co-designed method that enables effective and efficient long context LLM inference by leveraging product quantization (PQ) for KVCache management. PQCache reduces memory and computation burden by using PQ codes and centroids to facilitate efficient maximum inner product search (MIPS) for important tokens used in the attention module. Through meticulous overlapping and caching, PQCache minimizes overhead to a negligible level. The paper evaluates PQCache on extensive experiments, showing that PQCache effectively maintains model quality with only 1/5 of the tokens involved in attention, while achieving acceptable system latency. PQCache outperforms existing methods in terms of both effectiveness and efficiency. The paper also discusses related work, including selective attention for KVCache, KVCache quantization, KVCache scheduling, and embedding management. The results show that PQCache achieves significant improvements in model performance and system latency compared to existing methods.PQCache is a product quantization-based KVCache for long context LLM inference. The paper introduces PQCache, a system-algorithm co-designed method that enables effective and efficient long context LLM inference by leveraging product quantization (PQ) for KVCache management. PQCache reduces memory and computation burden by using PQ codes and centroids to facilitate efficient maximum inner product search (MIPS) for important tokens used in the attention module. Through meticulous overlapping and caching, PQCache minimizes overhead to a negligible level. The paper evaluates PQCache on extensive experiments, showing that PQCache effectively maintains model quality with only 1/5 of the tokens involved in attention, while achieving acceptable system latency. PQCache outperforms existing methods in terms of both effectiveness and efficiency. The paper also discusses related work, including selective attention for KVCache, KVCache quantization, KVCache scheduling, and embedding management. The results show that PQCache achieves significant improvements in model performance and system latency compared to existing methods.