SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

13 May 2024 | Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
SKVQ: Sliding-Window Key and Value Cache Quantization for Large Language Models This paper presents SKVQ, a sliding-window key-value (KV) cache quantization strategy to address the issue of extremely low-bitwidth KV cache quantization in large language models (LLMs). SKVQ improves quantization accuracy by rearranging channels in the KV cache to enhance similarity within quantization groups and applying clipped dynamic quantization at the group level. It also preserves the most recent window tokens in the KV cache with high precision, ensuring the accuracy of a small but important portion of the KV cache. SKVQ achieves high compression ratios while maintaining accuracy. Evaluation on LLMs shows that SKVQ surpasses previous quantization approaches, allowing for quantization of the KV cache to 2-bit keys and 1.5-bit values with minimal accuracy loss. With SKVQ, it is possible to process context lengths of up to 1M on an 80GB memory GPU for a 7b model and up to 7 times faster decoding. The key challenge in KV cache quantization is the significant variation in the distribution of different channels during the quantization process, which greatly impacts quantization accuracy, especially in extremely low-bitwidth scenarios. To alleviate this problem, SKVQ proposes a clipped dynamic quantization with channel reorder. First, it uses a transformation invariant permutation to group similar channels based on their statistical characteristics. Second, it applies clipped dynamic quantization to further mitigate the outlier problem. This approach greatly reduces the quantization error within each group, thus improving the accuracy of the quantized model. SKVQ also introduces a sliding window quantization strategy. This mechanism preserves a small portion of the most recently generated KV cache from being quantized. After generating new tokens, the probability of attending to the old tokens' KV cache decreases significantly, so the accuracy loss caused by quantizing them is minimal. The proposed method is named as sliding-window KV cache quantization (SKVQ). It is efficient and easy to implement in existing inference systems, making it practical for real-world deployment. Experiments on models of LLaMA and Mistral families show that SKVQ can quantize the key cache into 2 bits and value cache into 1.5 bits with almost no accuracy drop. Compared with previous quantization methods, SKVQ achieves optimal performance under different average bit widths. Our performance analysis shows SKVQ enables 1M context length in a single A100-80GB for a 7b model. Inference latency results show that in the case of batch size 128 and sequence length 200k, the theoretical 7x speedup in decoding phase can be achieved. SKVQ outperforms previous quantization approaches in long context tasks, demonstrating its effectiveness in reducing memory requirements and memory accesses. It also shows robustness in tasks like the needle-in-a-hSKVQ: Sliding-Window Key and Value Cache Quantization for Large Language Models This paper presents SKVQ, a sliding-window key-value (KV) cache quantization strategy to address the issue of extremely low-bitwidth KV cache quantization in large language models (LLMs). SKVQ improves quantization accuracy by rearranging channels in the KV cache to enhance similarity within quantization groups and applying clipped dynamic quantization at the group level. It also preserves the most recent window tokens in the KV cache with high precision, ensuring the accuracy of a small but important portion of the KV cache. SKVQ achieves high compression ratios while maintaining accuracy. Evaluation on LLMs shows that SKVQ surpasses previous quantization approaches, allowing for quantization of the KV cache to 2-bit keys and 1.5-bit values with minimal accuracy loss. With SKVQ, it is possible to process context lengths of up to 1M on an 80GB memory GPU for a 7b model and up to 7 times faster decoding. The key challenge in KV cache quantization is the significant variation in the distribution of different channels during the quantization process, which greatly impacts quantization accuracy, especially in extremely low-bitwidth scenarios. To alleviate this problem, SKVQ proposes a clipped dynamic quantization with channel reorder. First, it uses a transformation invariant permutation to group similar channels based on their statistical characteristics. Second, it applies clipped dynamic quantization to further mitigate the outlier problem. This approach greatly reduces the quantization error within each group, thus improving the accuracy of the quantized model. SKVQ also introduces a sliding window quantization strategy. This mechanism preserves a small portion of the most recently generated KV cache from being quantized. After generating new tokens, the probability of attending to the old tokens' KV cache decreases significantly, so the accuracy loss caused by quantizing them is minimal. The proposed method is named as sliding-window KV cache quantization (SKVQ). It is efficient and easy to implement in existing inference systems, making it practical for real-world deployment. Experiments on models of LLaMA and Mistral families show that SKVQ can quantize the key cache into 2 bits and value cache into 1.5 bits with almost no accuracy drop. Compared with previous quantization methods, SKVQ achieves optimal performance under different average bit widths. Our performance analysis shows SKVQ enables 1M context length in a single A100-80GB for a 7b model. Inference latency results show that in the case of batch size 128 and sequence length 200k, the theoretical 7x speedup in decoding phase can be achieved. SKVQ outperforms previous quantization approaches in long context tasks, demonstrating its effectiveness in reducing memory requirements and memory accesses. It also shows robustness in tasks like the needle-in-a-h
Reach us at info@study.space
Understanding SKVQ%3A Sliding-window Key and Value Cache Quantization for Large Language Models