[slides] KVQuant%3A Towards 10 Million Context Length LLM Inference with KV Cache Quantization

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization KVQuant is a method for achieving ultra-low precision quantization of Key-Value (KV) cache activations in large language models (LLMs), enabling efficient long-context inference with minimal accuracy loss. The method incorporates several novel techniques: (i) per-channel key quantization, which adjusts the quantization dimension to better match the distribution of Key activations; (ii) pre-RoPE key quantization, which quantizes Key activations before rotary positional embedding to mitigate its impact on quantization; (iii) non-uniform KV cache quantization, which derives per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) per-vector dense-and-sparse quantization, which isolates outliers separately for each vector to minimize quantization range skews. By applying these methods to LLaMA, Llama-2, Llama-3, and Mistral models, KVQuant achieves less than 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4. The method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. Custom CUDA kernels are developed for KVQuant, achieving up to ~1.7× speedups compared to baseline fp16 matrix-vector multiplications for LLaMA-7B. The code is available at <https://github.com/SqueezeAllLab/KVQuant/>. The paper discusses the challenges of LLM inference with long context lengths, which are memory and bandwidth bound. The KV cache is the main contributor to memory consumption for long sequences. KVQuant addresses this by compressing the KV cache with low-precision quantization, leveraging per-channel and per-vector quantization techniques. The method also includes non-uniform quantization, which considers sensitivity rather than just magnitude when quantizing activations. The paper also explores outlier-aware quantization, where outliers are isolated and compressed separately. The method achieves significant memory savings and enables long-context inference with minimal accuracy loss. The results show that KVQuant outperforms existing approaches in terms of both accuracy and efficiency. The paper concludes that KVQuant provides a promising solution for enabling efficient long-context inference in LLMs.KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization KVQuant is a method for achieving ultra-low precision quantization of Key-Value (KV) cache activations in large language models (LLMs), enabling efficient long-context inference with minimal accuracy loss. The method incorporates several novel techniques: (i) per-channel key quantization, which adjusts the quantization dimension to better match the distribution of Key activations; (ii) pre-RoPE key quantization, which quantizes Key activations before rotary positional embedding to mitigate its impact on quantization; (iii) non-uniform KV cache quantization, which derives per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) per-vector dense-and-sparse quantization, which isolates outliers separately for each vector to minimize quantization range skews. By applying these methods to LLaMA, Llama-2, Llama-3, and Mistral models, KVQuant achieves less than 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4. The method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. Custom CUDA kernels are developed for KVQuant, achieving up to ~1.7× speedups compared to baseline fp16 matrix-vector multiplications for LLaMA-7B. The code is available at <https://github.com/SqueezeAllLab/KVQuant/>. The paper discusses the challenges of LLM inference with long context lengths, which are memory and bandwidth bound. The KV cache is the main contributor to memory consumption for long sequences. KVQuant addresses this by compressing the KV cache with low-precision quantization, leveraging per-channel and per-vector quantization techniques. The method also includes non-uniform quantization, which considers sensitivity rather than just magnitude when quantizing activations. The paper also explores outlier-aware quantization, where outliers are isolated and compressed separately. The method achieves significant memory savings and enables long-context inference with minimal accuracy loss. The results show that KVQuant outperforms existing approaches in terms of both accuracy and efficiency. The paper concludes that KVQuant provides a promising solution for enabling efficient long-context inference in LLMs.

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

4 Jul 2024 | Coleman Hooper, Schoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami