17 Jun 2024 | Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
SnapKV is an efficient, fine-tuning-free method that reduces the size of the Key-Value (KV) cache while maintaining performance in real-world applications. It identifies and compresses the most important attention features for each attention head, significantly reducing computational overhead and memory usage. SnapKV achieves a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency compared to the baseline when processing 16K tokens. It also maintains comparable performance across 16 long sequence datasets and can process up to 380K tokens on a single A100-80GB GPU with minimal accuracy drop.
SnapKV works by maintaining a constant amount of prompt KVs during generation, reducing serving times for long-context LLMs. It uses an observation window to identify important attention features and clusters them for efficient compression. The method is robust across different instruction contexts and maintains high hit rates even when instructions vary. It also demonstrates strong performance in retrieval tasks and is compatible with parallel decoding frameworks, enhancing generation efficiency.
SnapKV was evaluated on multiple datasets, including LongBench, and showed significant improvements in memory efficiency and generation speed. It outperformed H2O in several benchmarks and maintained high accuracy even with compressed KV caches. SnapKV is also effective in retrieval-augmented generation (RAG) tasks, demonstrating robustness in handling long sequences and maintaining performance in various scenarios. The method is compatible with parallel decoding, further enhancing LLM efficiency in long-context scenarios. Overall, SnapKV provides an effective solution for reducing the computational and memory burdens of processing extensive prompts while maintaining performance.SnapKV is an efficient, fine-tuning-free method that reduces the size of the Key-Value (KV) cache while maintaining performance in real-world applications. It identifies and compresses the most important attention features for each attention head, significantly reducing computational overhead and memory usage. SnapKV achieves a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency compared to the baseline when processing 16K tokens. It also maintains comparable performance across 16 long sequence datasets and can process up to 380K tokens on a single A100-80GB GPU with minimal accuracy drop.
SnapKV works by maintaining a constant amount of prompt KVs during generation, reducing serving times for long-context LLMs. It uses an observation window to identify important attention features and clusters them for efficient compression. The method is robust across different instruction contexts and maintains high hit rates even when instructions vary. It also demonstrates strong performance in retrieval tasks and is compatible with parallel decoding frameworks, enhancing generation efficiency.
SnapKV was evaluated on multiple datasets, including LongBench, and showed significant improvements in memory efficiency and generation speed. It outperformed H2O in several benchmarks and maintained high accuracy even with compressed KV caches. SnapKV is also effective in retrieval-augmented generation (RAG) tasks, demonstrating robustness in handling long sequences and maintaining performance in various scenarios. The method is compatible with parallel decoding, further enhancing LLM efficiency in long-context scenarios. Overall, SnapKV provides an effective solution for reducing the computational and memory burdens of processing extensive prompts while maintaining performance.