4 Jul 2024 | Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
KVQuant is a method for achieving ultra-low precision quantization of Key-Value (KV) cache activations in large language models (LLMs), enabling efficient long-context inference. The method incorporates several novel techniques: (i) per-channel key quantization, which adjusts the quantization dimension to better match the distribution of key activations; (ii) pre-RoPE key quantization, which quantizes key activations before applying rotary positional embedding to mitigate its impact on quantization; (iii) non-uniform KV cache quantization, which derives per-layer sensitivity-weighted non-uniform datatypes to better represent the distributions; and (iv) per-vector dense-and-sparse quantization, which isolates outliers separately for each vector to minimize quantization range skews. By applying these methods to LLaMA, Llama-2, Llama-3, and Mistral models, KVQuant achieves less than 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4. The method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. Custom CUDA kernels are developed for KVQuant, achieving up to 1.7× speedups compared to baseline fp16 matrix-vector multiplications for LLaMA-7B. The code is available at https://github.com/SqueezeAILab/KVQuant.
KV cache quantization is crucial for efficient long-context inference in LLMs, as the memory bottleneck is strongly related to context size. Existing approaches fail to represent activations accurately in sub-4-bit precision. KVQuant addresses this by using per-channel key quantization, pre-RoPE key quantization, non-uniform quantization, and per-vector dense-and-sparse quantization. These methods enable accurate and efficient low-bit KV cache quantization, achieving significant memory savings and enabling long-context inference. The method is compatible with existing weight-only quantization methods and demonstrates improved performance on long-context tasks. KVQuant achieves 4.8× compression with only 0.1 perplexity degradation across different LLMs. The method supports inferring the LLaMA-7B model with a context length of 10M on an 8-GPU system. Through efficient kernel implementation, KVQuant demonstrates improved latency relative to the fp16 baseline, demonstrating how the method allows for improved latency in addition to memory savings.KVQuant is a method for achieving ultra-low precision quantization of Key-Value (KV) cache activations in large language models (LLMs), enabling efficient long-context inference. The method incorporates several novel techniques: (i) per-channel key quantization, which adjusts the quantization dimension to better match the distribution of key activations; (ii) pre-RoPE key quantization, which quantizes key activations before applying rotary positional embedding to mitigate its impact on quantization; (iii) non-uniform KV cache quantization, which derives per-layer sensitivity-weighted non-uniform datatypes to better represent the distributions; and (iv) per-vector dense-and-sparse quantization, which isolates outliers separately for each vector to minimize quantization range skews. By applying these methods to LLaMA, Llama-2, Llama-3, and Mistral models, KVQuant achieves less than 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4. The method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. Custom CUDA kernels are developed for KVQuant, achieving up to 1.7× speedups compared to baseline fp16 matrix-vector multiplications for LLaMA-7B. The code is available at https://github.com/SqueezeAILab/KVQuant.
KV cache quantization is crucial for efficient long-context inference in LLMs, as the memory bottleneck is strongly related to context size. Existing approaches fail to represent activations accurately in sub-4-bit precision. KVQuant addresses this by using per-channel key quantization, pre-RoPE key quantization, non-uniform quantization, and per-vector dense-and-sparse quantization. These methods enable accurate and efficient low-bit KV cache quantization, achieving significant memory savings and enabling long-context inference. The method is compatible with existing weight-only quantization methods and demonstrates improved performance on long-context tasks. KVQuant achieves 4.8× compression with only 0.1 perplexity degradation across different LLMs. The method supports inferring the LLaMA-7B model with a context length of 10M on an 8-GPU system. Through efficient kernel implementation, KVQuant demonstrates improved latency relative to the fp16 baseline, demonstrating how the method allows for improved latency in addition to memory savings.