23 May 2024 | Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang
ZipCache is an efficient and accurate method for compressing KV cache in large language models (LLMs). It addresses the significant storage demands of KV cache, especially for long sequences, by adaptively compressing less important tokens while preserving vital information. The key contributions of ZipCache include:
1. **Channel-Separable Tokenwise Quantization**: This method reduces the overhead of quantization parameters compared to fine-grained groupwise quantization, significantly improving compression efficiency.
2. **Normalized Attention Score for Salient Token Identification**: A novel metric based on normalized attention scores is introduced to accurately identify salient tokens, which are then adaptively quantized to a lower bit-width.
3. **Efficient Approximation of Salient Token Metric**: An efficient approximation method is developed to integrate with fast attention implementations like FlashAttention, enhancing generation speed and reducing memory usage.
Experiments on various benchmarks, including GSM8k, Line Retrieval, and HumanEval, demonstrate that ZipCache achieves superior compression ratios, fast generation speed, and minimal performance losses compared to previous methods. For example, on the GSM8k dataset, ZipCache compresses the KV cache by 4.98× with only a 0.38% drop in accuracy. Additionally, ZipCache reduces prefill-phase latency by 37.3%, decoding-phase latency by 56.9%, and GPU memory usage by 19.8% when evaluated on the LLaMA3-8B model with an input length of 4096.ZipCache is an efficient and accurate method for compressing KV cache in large language models (LLMs). It addresses the significant storage demands of KV cache, especially for long sequences, by adaptively compressing less important tokens while preserving vital information. The key contributions of ZipCache include:
1. **Channel-Separable Tokenwise Quantization**: This method reduces the overhead of quantization parameters compared to fine-grained groupwise quantization, significantly improving compression efficiency.
2. **Normalized Attention Score for Salient Token Identification**: A novel metric based on normalized attention scores is introduced to accurately identify salient tokens, which are then adaptively quantized to a lower bit-width.
3. **Efficient Approximation of Salient Token Metric**: An efficient approximation method is developed to integrate with fast attention implementations like FlashAttention, enhancing generation speed and reducing memory usage.
Experiments on various benchmarks, including GSM8k, Line Retrieval, and HumanEval, demonstrate that ZipCache achieves superior compression ratios, fast generation speed, and minimal performance losses compared to previous methods. For example, on the GSM8k dataset, ZipCache compresses the KV cache by 4.98× with only a 0.38% drop in accuracy. Additionally, ZipCache reduces prefill-phase latency by 37.3%, decoding-phase latency by 56.9%, and GPU memory usage by 19.8% when evaluated on the LLaMA3-8B model with an input length of 4096.