The paper "MiniCache: KV Cache Compression in Depth Dimension for Large Language Models" introduces a novel approach called MiniCache to compress Key-Value (KV) caches in large language models (LLMs). The KV cache stores key-value states of previously generated tokens, reducing the need for repetitive computations and lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. MiniCache addresses this issue by compressing the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference.
Key contributions of MiniCache include:
1. **Cross-Layer Compression**: MiniCache exploits the high similarity between KV cache states in adjacent layers, particularly in the middle-to-deep portions of LLMs. It merges these states into a single shared memory space, starting from the middle layer.
2. **Reparameterization and Interpolation**: The method disentangles the state vectors into magnitude and direction components, allowing for effective interpolation of the directional components while preserving the original state norms.
3. **Token Retention Strategy**: A token retention strategy is proposed to keep highly distinct state pairs unmerged, ensuring minimal performance degradation.
4. **Memory Efficiency**: The framework is training-free and general, complementing existing KV-cache compression strategies such as quantization and sparsity.
Experiments on various models, including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral, demonstrate that MiniCache achieves superior compression ratios and high throughput. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a compression ratio of up to 5.02×, enhances inference throughput by approximately 5×, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.
The paper also discusses related work, including efficient inference techniques for LLMs and model merging methods, and provides a detailed analysis of the proposed method's effectiveness and efficiency.The paper "MiniCache: KV Cache Compression in Depth Dimension for Large Language Models" introduces a novel approach called MiniCache to compress Key-Value (KV) caches in large language models (LLMs). The KV cache stores key-value states of previously generated tokens, reducing the need for repetitive computations and lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. MiniCache addresses this issue by compressing the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference.
Key contributions of MiniCache include:
1. **Cross-Layer Compression**: MiniCache exploits the high similarity between KV cache states in adjacent layers, particularly in the middle-to-deep portions of LLMs. It merges these states into a single shared memory space, starting from the middle layer.
2. **Reparameterization and Interpolation**: The method disentangles the state vectors into magnitude and direction components, allowing for effective interpolation of the directional components while preserving the original state norms.
3. **Token Retention Strategy**: A token retention strategy is proposed to keep highly distinct state pairs unmerged, ensuring minimal performance degradation.
4. **Memory Efficiency**: The framework is training-free and general, complementing existing KV-cache compression strategies such as quantization and sparsity.
Experiments on various models, including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral, demonstrate that MiniCache achieves superior compression ratios and high throughput. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a compression ratio of up to 5.02×, enhances inference throughput by approximately 5×, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.
The paper also discusses related work, including efficient inference techniques for LLMs and model merging methods, and provides a detailed analysis of the proposed method's effectiveness and efficiency.