PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

16 Jun 2024 | Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao
PyramidKV is a novel KV cache compression method that dynamically adjusts the KV cache size across different layers of large language models (LLMs), allocating more cache in lower layers and less in higher ones. This approach is inspired by the observation that LLMs aggregate information through Pyramidal Information Funneling, where attention is widely scattered in lower layers, progressively consolidates within specific contexts, and ultimately focuses on critical tokens in higher layers. PyramidKV significantly reduces memory usage while maintaining performance, achieving up to 20.5 absolute accuracy improvement on TREC with only 0.7% of the KV cache retained. Experimental evaluations on LongBench show that PyramidKV outperforms other KV cache compression techniques across various cache sizes, particularly in memory-constrained scenarios. The method effectively preserves long-context understanding ability and reduces memory consumption with minimal performance trade-offs. PyramidKV is designed to align with the increasing attention sparsity observed in multi-layer Transformers, making it a promising solution for efficient LLM inference.PyramidKV is a novel KV cache compression method that dynamically adjusts the KV cache size across different layers of large language models (LLMs), allocating more cache in lower layers and less in higher ones. This approach is inspired by the observation that LLMs aggregate information through Pyramidal Information Funneling, where attention is widely scattered in lower layers, progressively consolidates within specific contexts, and ultimately focuses on critical tokens in higher layers. PyramidKV significantly reduces memory usage while maintaining performance, achieving up to 20.5 absolute accuracy improvement on TREC with only 0.7% of the KV cache retained. Experimental evaluations on LongBench show that PyramidKV outperforms other KV cache compression techniques across various cache sizes, particularly in memory-constrained scenarios. The method effectively preserves long-context understanding ability and reduces memory consumption with minimal performance trade-offs. PyramidKV is designed to align with the increasing attention sparsity observed in multi-layer Transformers, making it a promising solution for efficient LLM inference.
Reach us at info@study.space