13 May 2024 | Haojie Duanmu*, Zhihang Yuan*, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
The paper "SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models" addresses the issue of memory consumption and performance bottlenecks in large language models (LLMs) due to the key-value (KV) cache required for processing long sequences. The authors propose a strategy called SKVQ (Sliding-window KV cache Quantization) to reduce the bitwidth of the KV cache while maintaining accuracy. SKVQ achieves this by rearranging the channels of the KV cache to improve similarity within quantization groups and applying clipped dynamic quantization at the group level. Additionally, it ensures that the most recent window tokens in the KV cache are preserved with high precision. The method is evaluated on various LLMs, demonstrating that it can quantize the key cache to 2 bits and the value cache to 1.5 bits with minimal accuracy loss. SKVQ enables processing context lengths of up to 1 million tokens on an 80GB memory GPU for a 7b model, achieving up to 7 times faster decoding. The paper also includes a detailed analysis of the effectiveness of different components of SKVQ and its impact on memory consumption and inference latency.The paper "SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models" addresses the issue of memory consumption and performance bottlenecks in large language models (LLMs) due to the key-value (KV) cache required for processing long sequences. The authors propose a strategy called SKVQ (Sliding-window KV cache Quantization) to reduce the bitwidth of the KV cache while maintaining accuracy. SKVQ achieves this by rearranging the channels of the KV cache to improve similarity within quantization groups and applying clipped dynamic quantization at the group level. Additionally, it ensures that the most recent window tokens in the KV cache are preserved with high precision. The method is evaluated on various LLMs, demonstrating that it can quantize the key cache to 2 bits and the value cache to 1.5 bits with minimal accuracy loss. SKVQ enables processing context lengths of up to 1 million tokens on an 80GB memory GPU for a 7b model, achieving up to 7 times faster decoding. The paper also includes a detailed analysis of the effectiveness of different components of SKVQ and its impact on memory consumption and inference latency.