17 Jun 2024 | Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
SnapKV is an innovative approach designed to efficiently minimize the Key-Value (KV) cache size while maintaining comparable performance in real-world applications. The authors observe that attention heads in large language models (LLMs) consistently focus on specific prompt attention features during generation, which can be identified from an 'observation' window at the end of the prompts. SnapKV automatically compresses the KV cache by selecting clustered important KV positions for each attention head, significantly reducing computational overhead and memory usage when processing long input sequences.
Key contributions of SnapKV include:
- A detailed exploration of attention allocation patterns during generation.
- An efficient and fine-tuning-free algorithm that identifies critical attention features and compresses the KV cache.
- Comprehensive evaluation across diverse LLMs and long-sequence datasets, showing improved decoding speed and memory efficiency without compromising accuracy.
SnapKV demonstrates its effectiveness through various experiments, including the Needle-in-a-Haystack test, where it achieves a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing 16K tokens. It also shows comparable performance across 16 long sequence datasets and can process up to 380K context tokens on a single A100-80GB GPU with minimal accuracy loss.
The paper discusses the limitations of existing methods and highlights the need for context-aware compression strategies. SnapKV's robustness is further demonstrated through experiments on different datasets and models, including LWM-Text-Chat-1M, LongBench, and Command-R, where it consistently outperforms baseline methods in terms of speed, memory efficiency, and accuracy.SnapKV is an innovative approach designed to efficiently minimize the Key-Value (KV) cache size while maintaining comparable performance in real-world applications. The authors observe that attention heads in large language models (LLMs) consistently focus on specific prompt attention features during generation, which can be identified from an 'observation' window at the end of the prompts. SnapKV automatically compresses the KV cache by selecting clustered important KV positions for each attention head, significantly reducing computational overhead and memory usage when processing long input sequences.
Key contributions of SnapKV include:
- A detailed exploration of attention allocation patterns during generation.
- An efficient and fine-tuning-free algorithm that identifies critical attention features and compresses the KV cache.
- Comprehensive evaluation across diverse LLMs and long-sequence datasets, showing improved decoding speed and memory efficiency without compromising accuracy.
SnapKV demonstrates its effectiveness through various experiments, including the Needle-in-a-Haystack test, where it achieves a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing 16K tokens. It also shows comparable performance across 16 long sequence datasets and can process up to 380K context tokens on a single A100-80GB GPU with minimal accuracy loss.
The paper discusses the limitations of existing methods and highlights the need for context-aware compression strategies. SnapKV's robustness is further demonstrated through experiments on different datasets and models, including LWM-Text-Chat-1M, LongBench, and Command-R, where it consistently outperforms baseline methods in terms of speed, memory efficiency, and accuracy.