[slides] PyramidInfer%3A Pyramid KV Cache Compression for High-throughput LLM Inference

**Abstract:** Large Language Models (LLMs) face significant challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To address this, the authors propose PyramidInfer, a method that compresses the Key-Value (KV) cache by layer-wise retaining crucial context. Existing methods focus on pruning pre-computed KV cache, but they neglect inter-layer dependencies and the high memory consumption in pre-computation. PyramidInfer identifies that the number of crucial keys and values decreases with depth, and these can be extracted based on the consistency in attention weights. This method significantly reduces memory usage without sacrificing performance, achieving a 2.2x improvement in throughput and a 54% reduction in KV cache memory compared to existing methods. **Introduction:** LLMs, such as GPT4, have demonstrated remarkable comprehension abilities but struggle with GPU memory usage during inference. The KV cache, which stores computed keys and values, is a major contributor to this issue. PyramidInfer aims to reduce the KV cache by layer-wise selecting crucial context, addressing the limitations of existing methods that only compress pre-computed cache or neglect the initial cache's high memory consumption. **Observations and Insights:** - **Inference Context Redundancy (ICR):** The hypothesis that some keys and values are redundant for inference, leading to a power-law distribution of redundancy across layers. - **Recent Attention Consistency (RAC):** The observation that recent tokens closer to the last token have consistent attention weights, indicating shared crucial context. **PyramidInfer:** - **Layer-wise PvC Selection:** PyramidInfer selects crucial keys and values (PvCs) layer-wise, reducing the length of PVs in deeper layers to form a "pyramid" structure. - **Prefill Phase:** Only PvCs are retained in the initial KV cache, significantly reducing memory usage. - **Generation Phase:** PvCs are updated with new tokens, maintaining efficiency. **Evaluation:** - **Benchmark Results:** PyramidInfer outperforms full cache methods and KV cache compression methods, achieving higher throughput and lower GPU memory usage. - **Ablation Study:** Further experiments validate the effectiveness of layer-wise PvC selection and the recent sequence ratio. **Conclusion:** PyramidInfer is a promising solution for optimizing LLM deployment in resource-constrained environments, significantly reducing GPU memory usage and improving inference efficiency.**Abstract:** Large Language Models (LLMs) face significant challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To address this, the authors propose PyramidInfer, a method that compresses the Key-Value (KV) cache by layer-wise retaining crucial context. Existing methods focus on pruning pre-computed KV cache, but they neglect inter-layer dependencies and the high memory consumption in pre-computation. PyramidInfer identifies that the number of crucial keys and values decreases with depth, and these can be extracted based on the consistency in attention weights. This method significantly reduces memory usage without sacrificing performance, achieving a 2.2x improvement in throughput and a 54% reduction in KV cache memory compared to existing methods. **Introduction:** LLMs, such as GPT4, have demonstrated remarkable comprehension abilities but struggle with GPU memory usage during inference. The KV cache, which stores computed keys and values, is a major contributor to this issue. PyramidInfer aims to reduce the KV cache by layer-wise selecting crucial context, addressing the limitations of existing methods that only compress pre-computed cache or neglect the initial cache's high memory consumption. **Observations and Insights:** - **Inference Context Redundancy (ICR):** The hypothesis that some keys and values are redundant for inference, leading to a power-law distribution of redundancy across layers. - **Recent Attention Consistency (RAC):** The observation that recent tokens closer to the last token have consistent attention weights, indicating shared crucial context. **PyramidInfer:** - **Layer-wise PvC Selection:** PyramidInfer selects crucial keys and values (PvCs) layer-wise, reducing the length of PVs in deeper layers to form a "pyramid" structure. - **Prefill Phase:** Only PvCs are retained in the initial KV cache, significantly reducing memory usage. - **Generation Phase:** PvCs are updated with new tokens, maintaining efficiency. **Evaluation:** - **Benchmark Results:** PyramidInfer outperforms full cache methods and KV cache compression methods, achieving higher throughput and lower GPU memory usage. - **Ablation Study:** Further experiments validate the effectiveness of layer-wise PvC selection and the recent sequence ratio. **Conclusion:** PyramidInfer is a promising solution for optimizing LLM deployment in resource-constrained environments, significantly reducing GPU memory usage and improving inference efficiency.

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

5 Jun 2024 | Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao