KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

2024 | Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhou Xu, Vladimir Braverman, Beidi Chen, Xia Hu
KIVI is a tuning-free 2-bit asymmetric quantization method for key-value (KV) cache in large language models (LLMs). The method reduces the memory footprint of KV cache by quantizing key cache per-channel and value cache per-token. This approach minimizes the impact on model accuracy while significantly reducing peak memory usage. KIVI enables Llama, Falcon, and Mistral models to maintain almost the same quality while using 2.6× less peak memory, allowing up to 4× larger batch sizes and improving throughput by 2.35× to 3.47× on real LLM inference workloads. The algorithm is implemented in a hardware-friendly manner, ensuring efficient execution on GPUs. The KV cache is a critical component in LLM inference, as it stores attention keys and values to avoid recomputation. However, with larger batch sizes and longer context lengths, the KV cache becomes a major bottleneck in terms of memory and speed. Existing methods for reducing KV cache size include reducing the number of attention heads, evicting unimportant tokens, or using system-level optimizations. However, these approaches often require training or fine-tuning, which is not feasible for all applications. KIVI's approach is based on an analysis of the element distribution in KV cache, which reveals that key cache has a few channels with large magnitudes, while value cache does not exhibit a clear outlier pattern. This insight leads to the conclusion that key cache should be quantized per-channel and value cache per-token. By doing so, KIVI minimizes quantization errors and maintains model accuracy. The algorithm is designed to work with the streaming nature of auto-regressive inference, allowing newly quantized tensors to be directly appended to the existing quantized value cache by token dimension. For key cache, the quantization process spans different tokens, which requires a different approach, such as grouping tokens and quantizing them separately. The implementation of KIVI involves splitting the key and value caches into grouped and residual parts. The grouped parts are quantized, while the residual parts are kept in full precision. This allows for efficient computation during the decoding phase, where the attention scores are calculated using a tiled matrix multiplication. The residual parts are then combined with the grouped parts to maintain the accuracy of the model. Experiments show that KIVI achieves significant improvements in memory usage and throughput while maintaining model accuracy. The method is evaluated on various LLMs, including Llama, Falcon, and Mistral, and demonstrates its effectiveness in reducing peak memory usage and increasing batch sizes. The results indicate that KIVI is a promising approach for improving the efficiency of LLM inference without compromising model performance.KIVI is a tuning-free 2-bit asymmetric quantization method for key-value (KV) cache in large language models (LLMs). The method reduces the memory footprint of KV cache by quantizing key cache per-channel and value cache per-token. This approach minimizes the impact on model accuracy while significantly reducing peak memory usage. KIVI enables Llama, Falcon, and Mistral models to maintain almost the same quality while using 2.6× less peak memory, allowing up to 4× larger batch sizes and improving throughput by 2.35× to 3.47× on real LLM inference workloads. The algorithm is implemented in a hardware-friendly manner, ensuring efficient execution on GPUs. The KV cache is a critical component in LLM inference, as it stores attention keys and values to avoid recomputation. However, with larger batch sizes and longer context lengths, the KV cache becomes a major bottleneck in terms of memory and speed. Existing methods for reducing KV cache size include reducing the number of attention heads, evicting unimportant tokens, or using system-level optimizations. However, these approaches often require training or fine-tuning, which is not feasible for all applications. KIVI's approach is based on an analysis of the element distribution in KV cache, which reveals that key cache has a few channels with large magnitudes, while value cache does not exhibit a clear outlier pattern. This insight leads to the conclusion that key cache should be quantized per-channel and value cache per-token. By doing so, KIVI minimizes quantization errors and maintains model accuracy. The algorithm is designed to work with the streaming nature of auto-regressive inference, allowing newly quantized tensors to be directly appended to the existing quantized value cache by token dimension. For key cache, the quantization process spans different tokens, which requires a different approach, such as grouping tokens and quantizing them separately. The implementation of KIVI involves splitting the key and value caches into grouped and residual parts. The grouped parts are quantized, while the residual parts are kept in full precision. This allows for efficient computation during the decoding phase, where the attention scores are calculated using a tiled matrix multiplication. The residual parts are then combined with the grouped parts to maintain the accuracy of the model. Experiments show that KIVI achieves significant improvements in memory usage and throughput while maintaining model accuracy. The method is evaluated on various LLMs, including Llama, Falcon, and Mistral, and demonstrates its effectiveness in reducing peak memory usage and increasing batch sizes. The results indicate that KIVI is a promising approach for improving the efficiency of LLM inference without compromising model performance.
Reach us at info@study.space