25 May 2024 | Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan
This paper introduces INTACTKV, a novel method to improve the quantization of large language models (LLMs) by preserving the KV cache of pivot tokens. Pivot tokens, which are initially tokens in the input sequence, exhibit extreme values and are crucial for the performance of LLMs. The proposed method generates a lossless KV cache of these pivot tokens from the full-precision model, which is then concatenated with the quantized model's KV cache. This approach effectively reduces quantization errors and maintains the performance of the quantized LLMs. INTACTKV can be integrated with existing quantization methods without additional inference overhead and can be calibrated as additional parameters to further enhance the quantized models. Empirical results show that INTACTKV consistently improves the performance of various quantization methods across different LLMs and downstream tasks, achieving new state-of-the-art results for LLM quantization. The method is simple, easy to implement, and demonstrates significant improvements in accuracy and efficiency.This paper introduces INTACTKV, a novel method to improve the quantization of large language models (LLMs) by preserving the KV cache of pivot tokens. Pivot tokens, which are initially tokens in the input sequence, exhibit extreme values and are crucial for the performance of LLMs. The proposed method generates a lossless KV cache of these pivot tokens from the full-precision model, which is then concatenated with the quantized model's KV cache. This approach effectively reduces quantization errors and maintains the performance of the quantized LLMs. INTACTKV can be integrated with existing quantization methods without additional inference overhead and can be calibrated as additional parameters to further enhance the quantized models. Empirical results show that INTACTKV consistently improves the performance of various quantization methods across different LLMs and downstream tasks, achieving new state-of-the-art results for LLM quantization. The method is simple, easy to implement, and demonstrates significant improvements in accuracy and efficiency.