[slides and audio] KIVI%3A A Tuning-Free Asymmetric 2bit Quantization for KV Cache

The paper "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" addresses the challenge of reducing the memory demands and improving the speed of key-value (KV) cache in large language models (LLMs). The authors conduct a comprehensive study on the element distribution in KV cache and find that key cache should be quantized per-channel, while value cache should be quantized per-token. Based on these findings, they propose KIVI, a tuning-free 2bit KV cache quantization algorithm. KIVI enables LLMs like Llama, Falcon, and Mistral to maintain almost the same quality while using 2.6× less peak memory, enabling up to 4× larger batch sizes and a 2.35× to 3.47× throughput improvement on real LLM inference workloads. The paper also includes extensive experimental results and ablation studies to validate the effectiveness of KIVI.The paper "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" addresses the challenge of reducing the memory demands and improving the speed of key-value (KV) cache in large language models (LLMs). The authors conduct a comprehensive study on the element distribution in KV cache and find that key cache should be quantized per-channel, while value cache should be quantized per-token. Based on these findings, they propose KIVI, a tuning-free 2bit KV cache quantization algorithm. KIVI enables LLMs like Llama, Falcon, and Mistral to maintain almost the same quality while using 2.6× less peak memory, enabling up to 4× larger batch sizes and a 2.35× to 3.47× throughput improvement on real LLM inference workloads. The paper also includes extensive experimental results and ablation studies to validate the effectiveness of KIVI.

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

2024 | Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu