PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

5 Jun 2024 | Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao
PyramidInfer is a method for compressing the KV cache in high-throughput LLM inference by layer-wise retaining crucial context. It reduces GPU memory usage and improves throughput by computing fewer keys and values without sacrificing performance. The method leverages the observation that the number of crucial keys and values influencing future generations decreases layer by layer, and that attention weights are consistent in recent tokens. PyramidInfer compresses the KV cache during both the prefill and generation phases, achieving a 2.2x throughput improvement over Accelerate with over 54% GPU memory reduction in KV cache. It is effective across various tasks and models, including language modeling, benchmarks, conversation, and long context tasks. The method is orthogonal to non-KV-compression techniques like Deepspeed and can be used to enhance their efficiency. PyramidInfer is designed to reduce the KV cache by selecting pivotal contexts (PvCs) based on attention weights, with deeper layers having shorter PvCs. The method is validated through experiments showing that it maintains generation quality while reducing memory usage. However, it has limitations, including limited speedup with small batch sizes and the need for further research in prefill-phase KV cache compression.PyramidInfer is a method for compressing the KV cache in high-throughput LLM inference by layer-wise retaining crucial context. It reduces GPU memory usage and improves throughput by computing fewer keys and values without sacrificing performance. The method leverages the observation that the number of crucial keys and values influencing future generations decreases layer by layer, and that attention weights are consistent in recent tokens. PyramidInfer compresses the KV cache during both the prefill and generation phases, achieving a 2.2x throughput improvement over Accelerate with over 54% GPU memory reduction in KV cache. It is effective across various tasks and models, including language modeling, benchmarks, conversation, and long context tasks. The method is orthogonal to non-KV-compression techniques like Deepspeed and can be used to enhance their efficiency. PyramidInfer is designed to reduce the KV cache by selecting pivotal contexts (PvCs) based on attention weights, with deeper layers having shorter PvCs. The method is validated through experiments showing that it maintains generation quality while reducing memory usage. However, it has limitations, including limited speedup with small batch sizes and the need for further research in prefill-phase KV cache compression.
Reach us at info@study.space