KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

4 Jul 2024 | Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
KVQuant is a method for achieving ultra-low precision quantization of Key-Value (KV) cache activations in large language models (LLMs), enabling efficient long-context inference. The method incorporates several novel techniques: (i) per-channel key quantization, which adjusts the quantization dimension to better match the distribution of key activations; (ii) pre-RoPE key quantization, which quantizes key activations before applying rotary positional embedding to mitigate its impact on quantization; (iii) non-uniform KV cache quantization, which derives per-layer sensitivity-weighted non-uniform datatypes to better represent the distributions; and (iv) per-vector dense-and-sparse quantization, which isolates outliers separately for each vector to minimize quantization range skews. By applying these methods to LLaMA, Llama-2, Llama-3, and Mistral models, KVQuant achieves less than 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4. The method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. Custom CUDA kernels are developed for KVQuant, achieving up to 1.7× speedups compared to baseline fp16 matrix-vector multiplications for LLaMA-7B. The code is available at https://github.com/SqueezeAILab/KVQuant. KV cache quantization is crucial for efficient long-context inference in LLMs, as the memory bottleneck is strongly related to context size. Existing approaches fail to represent activations accurately in sub-4-bit precision. KVQuant addresses this by using per-channel key quantization, pre-RoPE key quantization, non-uniform quantization, and per-vector dense-and-sparse quantization. These methods enable accurate and efficient low-bit KV cache quantization, achieving significant memory savings and enabling long-context inference. The method is compatible with existing weight-only quantization methods and demonstrates improved performance on long-context tasks. KVQuant achieves 4.8× compression with only 0.1 perplexity degradation across different LLMs. The method supports inferring the LLaMA-7B model with a context length of 10M on an 8-GPU system. Through efficient kernel implementation, KVQuant demonstrates improved latency relative to the fp16 baseline, demonstrating how the method allows for improved latency in addition to memory savings.KVQuant is a method for achieving ultra-low precision quantization of Key-Value (KV) cache activations in large language models (LLMs), enabling efficient long-context inference. The method incorporates several novel techniques: (i) per-channel key quantization, which adjusts the quantization dimension to better match the distribution of key activations; (ii) pre-RoPE key quantization, which quantizes key activations before applying rotary positional embedding to mitigate its impact on quantization; (iii) non-uniform KV cache quantization, which derives per-layer sensitivity-weighted non-uniform datatypes to better represent the distributions; and (iv) per-vector dense-and-sparse quantization, which isolates outliers separately for each vector to minimize quantization range skews. By applying these methods to LLaMA, Llama-2, Llama-3, and Mistral models, KVQuant achieves less than 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4. The method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. Custom CUDA kernels are developed for KVQuant, achieving up to 1.7× speedups compared to baseline fp16 matrix-vector multiplications for LLaMA-7B. The code is available at https://github.com/SqueezeAILab/KVQuant. KV cache quantization is crucial for efficient long-context inference in LLMs, as the memory bottleneck is strongly related to context size. Existing approaches fail to represent activations accurately in sub-4-bit precision. KVQuant addresses this by using per-channel key quantization, pre-RoPE key quantization, non-uniform quantization, and per-vector dense-and-sparse quantization. These methods enable accurate and efficient low-bit KV cache quantization, achieving significant memory savings and enabling long-context inference. The method is compatible with existing weight-only quantization methods and demonstrates improved performance on long-context tasks. KVQuant achieves 4.8× compression with only 0.1 perplexity degradation across different LLMs. The method supports inferring the LLaMA-7B model with a context length of 10M on an 8-GPU system. Through efficient kernel implementation, KVQuant demonstrates improved latency relative to the fp16 baseline, demonstrating how the method allows for improved latency in addition to memory savings.
Reach us at info@study.space