PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

16 Jun 2024 | Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao
The paper "PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling" explores the attention patterns in large language models (LLMs) when processing long context inputs. The authors observe that LLMs aggregate information through a "Pyramidal Information Funneling" pattern, where attention is initially scattered widely in lower layers, gradually consolidates within specific contexts, and focuses on critical tokens in higher layers. This insight leads to the development of PyramidKV, a novel KV cache compression method that dynamically adjusts the cache size across different layers, allocating more cache in lower layers and less in higher ones. PyramidKV is designed to better align with the increasing attention sparsity observed in multi-layer Transformers. Experimental evaluations using the LongBench benchmark show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC. The paper also includes a detailed analysis of the attention patterns in multi-document question answering tasks, identifying a transition from broad coverage in lower layers to narrow focus in higher layers. This analysis provides unique insights into how information is aggregated and processed in LLMs. The proposed PyramidKV method is evaluated on various tasks and datasets, demonstrating superior performance and memory efficiency compared to existing methods.The paper "PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling" explores the attention patterns in large language models (LLMs) when processing long context inputs. The authors observe that LLMs aggregate information through a "Pyramidal Information Funneling" pattern, where attention is initially scattered widely in lower layers, gradually consolidates within specific contexts, and focuses on critical tokens in higher layers. This insight leads to the development of PyramidKV, a novel KV cache compression method that dynamically adjusts the cache size across different layers, allocating more cache in lower layers and less in higher ones. PyramidKV is designed to better align with the increasing attention sparsity observed in multi-layer Transformers. Experimental evaluations using the LongBench benchmark show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC. The paper also includes a detailed analysis of the attention patterns in multi-document question answering tasks, identifying a transition from broad coverage in lower layers to narrow focus in higher layers. This analysis provides unique insights into how information is aggregated and processed in LLMs. The proposed PyramidKV method is evaluated on various tasks and datasets, demonstrating superior performance and memory efficiency compared to existing methods.
Reach us at info@study.space