MiniCache is a novel approach for compressing Key-Value (KV) caches in large language models (LLMs) by leveraging the high similarity between adjacent layers in the middle-to-deep portions of the model. This method reduces the memory footprint of LLM inference while maintaining near-lossless performance. The core idea is to merge KV cache states across layers by decomposing them into magnitude and direction components, allowing for efficient interpolation of directional information while preserving the original state norms. Additionally, a token retention strategy is introduced to ensure that highly distinct state pairs are not merged, preserving critical information with minimal additional storage overhead. MiniCache is training-free and general, complementing existing compression strategies such as quantization and sparsity.
Extensive experiments on various models including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral across multiple benchmarks demonstrate that MiniCache achieves a compression ratio of up to 5.02×, enhances inference throughput by approximately 5×, and reduces memory footprint by 41% compared to the FP16 full cache baseline. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a remarkable compression ratio of up to 5.02×, enhances inference throughput by approximately 5×, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.
The method is efficient and memory-friendly, requiring storage for only a single high-dimensional directional component, along with minimal extra memory overhead. It is particularly effective for large LLMs, demonstrating significant performance improvements in both memory efficiency and throughput. MiniCache's approach of cross-layer merging and token retention provides a promising solution for efficient LLM inference, offering a state-of-the-art balance between efficiency and performance.MiniCache is a novel approach for compressing Key-Value (KV) caches in large language models (LLMs) by leveraging the high similarity between adjacent layers in the middle-to-deep portions of the model. This method reduces the memory footprint of LLM inference while maintaining near-lossless performance. The core idea is to merge KV cache states across layers by decomposing them into magnitude and direction components, allowing for efficient interpolation of directional information while preserving the original state norms. Additionally, a token retention strategy is introduced to ensure that highly distinct state pairs are not merged, preserving critical information with minimal additional storage overhead. MiniCache is training-free and general, complementing existing compression strategies such as quantization and sparsity.
Extensive experiments on various models including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral across multiple benchmarks demonstrate that MiniCache achieves a compression ratio of up to 5.02×, enhances inference throughput by approximately 5×, and reduces memory footprint by 41% compared to the FP16 full cache baseline. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a remarkable compression ratio of up to 5.02×, enhances inference throughput by approximately 5×, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.
The method is efficient and memory-friendly, requiring storage for only a single high-dimensional directional component, along with minimal extra memory overhead. It is particularly effective for large LLMs, demonstrating significant performance improvements in both memory efficiency and throughput. MiniCache's approach of cross-layer merging and token retention provides a promising solution for efficient LLM inference, offering a state-of-the-art balance between efficiency and performance.