25 May 2024 | Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhou Xu, Lu Hou, Jun Yao, Chun Yuan
IntactKV improves large language model (LLM) quantization by preserving the KV cache of pivot tokens, which are critical for model performance. The paper identifies a previously overlooked type of outliers in LLMs that concentrate attention scores on initial tokens, termed pivot tokens. These outliers significantly affect quantization performance, as their values are highly sensitive to quantization errors. To address this, the authors propose IntactKV, a method that generates the KV cache of pivot tokens losslessly from the full-precision model. This approach is simple, efficient, and can be integrated with existing quantization solutions without additional inference overhead. IntactKV can also be calibrated as additional parameters to further improve quantized LLMs with minimal training cost. Mathematical analysis shows that IntactKV effectively reduces the upper bound of quantization error. Empirical results demonstrate that IntactKV consistently improves the performance of various quantization methods across different LLMs and downstream tasks, achieving new state-of-the-art results for LLM quantization. The code is available at https://github.com/ruikangliu/IntactKV. The method is effective for weight-only quantization and can be extended to KV cache and activation quantization. Experiments show that IntactKV significantly improves the accuracy of quantized models, especially for larger models with higher quantization errors. The method is also lightweight, with minimal training and inference overhead. IntactKV is shown to be effective for various LLMs, including LLaMA, LLaMA-2, Vicuna, and others. The results demonstrate that IntactKV is a promising approach for improving LLM quantization.IntactKV improves large language model (LLM) quantization by preserving the KV cache of pivot tokens, which are critical for model performance. The paper identifies a previously overlooked type of outliers in LLMs that concentrate attention scores on initial tokens, termed pivot tokens. These outliers significantly affect quantization performance, as their values are highly sensitive to quantization errors. To address this, the authors propose IntactKV, a method that generates the KV cache of pivot tokens losslessly from the full-precision model. This approach is simple, efficient, and can be integrated with existing quantization solutions without additional inference overhead. IntactKV can also be calibrated as additional parameters to further improve quantized LLMs with minimal training cost. Mathematical analysis shows that IntactKV effectively reduces the upper bound of quantization error. Empirical results demonstrate that IntactKV consistently improves the performance of various quantization methods across different LLMs and downstream tasks, achieving new state-of-the-art results for LLM quantization. The code is available at https://github.com/ruikangliu/IntactKV. The method is effective for weight-only quantization and can be extended to KV cache and activation quantization. Experiments show that IntactKV significantly improves the accuracy of quantized models, especially for larger models with higher quantization errors. The method is also lightweight, with minimal training and inference overhead. IntactKV is shown to be effective for various LLMs, including LLaMA, LLaMA-2, Vicuna, and others. The results demonstrate that IntactKV is a promising approach for improving LLM quantization.