A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression

A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression

17 Jun 2024 | Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
The paper presents a simple and effective strategy for compressing the Key-Value (KV) cache in large language models (LLMs) to reduce memory requirements, especially as context lengths increase. The authors observe that the $L_2$ norm of key embeddings is highly correlated with attention scores over cached KV pairs. Specifically, key embeddings with a low $L_2$ norm tend to have higher attention scores during decoding. Based on this observation, the proposed method retains only the keys with the lowest $L_2$ norms and their corresponding values, significantly reducing the KV cache size without compromising accuracy. Experimental results show that this approach can reduce the KV cache size by 50% on language modeling tasks and 90% on passkey retrieval tasks, maintaining performance in both scenarios. The method is straightforward and can be applied to any transformer-based, decoder-only LLM without additional training or significant modifications. The paper also discusses the limitations and future work, including the need for further theoretical exploration to understand the underlying reasons for the correlation between $L_2$ norm and attention scores.The paper presents a simple and effective strategy for compressing the Key-Value (KV) cache in large language models (LLMs) to reduce memory requirements, especially as context lengths increase. The authors observe that the $L_2$ norm of key embeddings is highly correlated with attention scores over cached KV pairs. Specifically, key embeddings with a low $L_2$ norm tend to have higher attention scores during decoding. Based on this observation, the proposed method retains only the keys with the lowest $L_2$ norms and their corresponding values, significantly reducing the KV cache size without compromising accuracy. Experimental results show that this approach can reduce the KV cache size by 50% on language modeling tasks and 90% on passkey retrieval tasks, maintaining performance in both scenarios. The method is straightforward and can be applied to any transformer-based, decoder-only LLM without additional training or significant modifications. The paper also discusses the limitations and future work, including the need for further theoretical exploration to understand the underlying reasons for the correlation between $L_2$ norm and attention scores.
Reach us at info@study.space