Effectively Compress KV Heads for LLM

Effectively Compress KV Heads for LLM

11 Jun 2024 | Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu
This paper proposes a novel method to compress Key-Value (KV) heads in large language models (LLMs) by leveraging the low-rank property of KV caches. The authors demonstrate that only a small number of singular values in KV caches are needed to retain most of the context information, enabling significant compression of KV heads while maintaining performance comparable to the original models. They introduce a framework that uses singular value decomposition (SVD) to compress KV heads, which allows for efficient conversion from multi-head attention (MHA) to grouped-query attention (GQA) without sacrificing accuracy. Additionally, they propose specialized strategies to handle rotary position embeddings (RoPE), ensuring compatibility with RoPE while achieving effective compression. The method can compress up to three-quarters of KV heads, significantly reducing memory footprint and improving inference speed. The authors also show that their approach outperforms existing compression strategies, such as direct mean-pooling of KV heads, by preserving more context information and requiring fewer training resources. The results demonstrate that their method achieves comparable accuracy to the original models while significantly reducing KV cache sizes, making it a promising solution for more efficient LLM deployment in resource-constrained environments.This paper proposes a novel method to compress Key-Value (KV) heads in large language models (LLMs) by leveraging the low-rank property of KV caches. The authors demonstrate that only a small number of singular values in KV caches are needed to retain most of the context information, enabling significant compression of KV heads while maintaining performance comparable to the original models. They introduce a framework that uses singular value decomposition (SVD) to compress KV heads, which allows for efficient conversion from multi-head attention (MHA) to grouped-query attention (GQA) without sacrificing accuracy. Additionally, they propose specialized strategies to handle rotary position embeddings (RoPE), ensuring compatibility with RoPE while achieving effective compression. The method can compress up to three-quarters of KV heads, significantly reducing memory footprint and improving inference speed. The authors also show that their approach outperforms existing compression strategies, such as direct mean-pooling of KV heads, by preserving more context information and requiring fewer training resources. The results demonstrate that their method achieves comparable accuracy to the original models while significantly reducing KV cache sizes, making it a promising solution for more efficient LLM deployment in resource-constrained environments.
Reach us at info@study.space