PyramidKV is a novel KV cache compression method that dynamically adjusts the KV cache size across different layers of large language models (LLMs), allocating more cache in lower layers and less in higher ones. This approach is inspired by the observation that LLMs aggregate information through Pyramidal Information Funneling, where attention is widely scattered in lower layers, progressively consolidates within specific contexts, and ultimately focuses on critical tokens in higher layers. PyramidKV significantly reduces memory usage while maintaining performance, achieving up to 20.5 absolute accuracy improvement on TREC with only 0.7% of the KV cache retained. Experimental evaluations on LongBench show that PyramidKV outperforms other KV cache compression techniques across various cache sizes, particularly in memory-constrained scenarios. The method effectively preserves long-context understanding ability and reduces memory consumption with minimal performance trade-offs. PyramidKV is designed to align with the increasing attention sparsity observed in multi-layer Transformers, making it a promising solution for efficient LLM inference.PyramidKV is a novel KV cache compression method that dynamically adjusts the KV cache size across different layers of large language models (LLMs), allocating more cache in lower layers and less in higher ones. This approach is inspired by the observation that LLMs aggregate information through Pyramidal Information Funneling, where attention is widely scattered in lower layers, progressively consolidates within specific contexts, and ultimately focuses on critical tokens in higher layers. PyramidKV significantly reduces memory usage while maintaining performance, achieving up to 20.5 absolute accuracy improvement on TREC with only 0.7% of the KV cache retained. Experimental evaluations on LongBench show that PyramidKV outperforms other KV cache compression techniques across various cache sizes, particularly in memory-constrained scenarios. The method effectively preserves long-context understanding ability and reduces memory consumption with minimal performance trade-offs. PyramidKV is designed to align with the increasing attention sparsity observed in multi-layer Transformers, making it a promising solution for efficient LLM inference.