ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

23 May 2024 | Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang
ZipCache is an accurate and efficient KV cache quantization method for large language models (LLMs). It addresses the issue of high memory usage and computational overhead in KV cache compression by identifying salient tokens and applying adaptive quantization. The method introduces a channel-separable tokenwise quantization scheme that reduces the memory overhead of quantization parameters compared to traditional groupwise quantization. It also proposes a normalized attention score as a metric for identifying salient tokens, which helps in preserving important information while compressing less important tokens. Additionally, an efficient approximation method is developed to decouple the saliency metric from full attention scores, enabling compatibility with fast attention implementations like FlashAttention. Extensive experiments show that ZipCache achieves superior compression ratios, fast generation speed, and minimal performance losses compared to previous KV cache compression methods. For example, when evaluating the Mistral-7B model on the GSM8k dataset, ZipCache compresses the KV cache by 4.98× with only a 0.38% drop in accuracy. In terms of efficiency, ZipCache also showcases a 37.3% reduction in prefill-phase latency, a 56.9% reduction in decoding-phase latency, and a 19.8% reduction in GPU memory usage when evaluating the LLaMA3-8B model with an input length of 4096. The method's contributions include an efficient channel-separable quantization scheme, an accurate metric for assessing token saliency based on normalized attention scores, and an efficient approximation method for the token saliency metric that integrates seamlessly with fast attention implementations. These techniques together present ZipCache as an accurate and efficient framework for KV cache compression.ZipCache is an accurate and efficient KV cache quantization method for large language models (LLMs). It addresses the issue of high memory usage and computational overhead in KV cache compression by identifying salient tokens and applying adaptive quantization. The method introduces a channel-separable tokenwise quantization scheme that reduces the memory overhead of quantization parameters compared to traditional groupwise quantization. It also proposes a normalized attention score as a metric for identifying salient tokens, which helps in preserving important information while compressing less important tokens. Additionally, an efficient approximation method is developed to decouple the saliency metric from full attention scores, enabling compatibility with fast attention implementations like FlashAttention. Extensive experiments show that ZipCache achieves superior compression ratios, fast generation speed, and minimal performance losses compared to previous KV cache compression methods. For example, when evaluating the Mistral-7B model on the GSM8k dataset, ZipCache compresses the KV cache by 4.98× with only a 0.38% drop in accuracy. In terms of efficiency, ZipCache also showcases a 37.3% reduction in prefill-phase latency, a 56.9% reduction in decoding-phase latency, and a 19.8% reduction in GPU memory usage when evaluating the LLaMA3-8B model with an input length of 4096. The method's contributions include an efficient channel-separable quantization scheme, an accurate metric for assessing token saliency based on normalized attention scores, and an efficient approximation method for the token saliency metric that integrates seamlessly with fast attention implementations. These techniques together present ZipCache as an accurate and efficient framework for KV cache compression.
Reach us at info@study.space
[slides and audio] ZipCache%3A Accurate and Efficient KV Cache Quantization with Salient Token Identification