Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

21 May 2024 | William Brandon, Mayank Mishra, Aniruddha Narasimha, Rameswar Panda, Jonathan Ragan-Kelley
This paper introduces Cross-Layer Attention (CLA), a novel method to reduce the memory footprint of the key-value (KV) cache in transformer-based large language models (LLMs). CLA shares key and value activations across adjacent layers, reducing the KV cache size by an additional 2× compared to Multi-Query Attention (MQA), while maintaining nearly the same accuracy. The method is orthogonal to MQA and GQA and can be combined with them. CLA enables a Pareto improvement in memory/accuracy tradeoffs, allowing for longer sequence lengths and larger batch sizes during inference. The paper presents extensive experiments on 1B- and 3B-parameter models, demonstrating that CLA achieves significant memory savings with minimal accuracy degradation. The results show that CLA provides consistent benefits across different model scales and configurations, and that combining CLA with MQA achieves the best accuracy/memory tradeoffs. The paper also discusses related work on KV cache compression and architectural changes that reduce KV cache size, and concludes that CLA is an effective method for improving the efficiency of transformer-based LLMs.This paper introduces Cross-Layer Attention (CLA), a novel method to reduce the memory footprint of the key-value (KV) cache in transformer-based large language models (LLMs). CLA shares key and value activations across adjacent layers, reducing the KV cache size by an additional 2× compared to Multi-Query Attention (MQA), while maintaining nearly the same accuracy. The method is orthogonal to MQA and GQA and can be combined with them. CLA enables a Pareto improvement in memory/accuracy tradeoffs, allowing for longer sequence lengths and larger batch sizes during inference. The paper presents extensive experiments on 1B- and 3B-parameter models, demonstrating that CLA achieves significant memory savings with minimal accuracy degradation. The results show that CLA provides consistent benefits across different model scales and configurations, and that combining CLA with MQA achieves the best accuracy/memory tradeoffs. The paper also discusses related work on KV cache compression and architectural changes that reduce KV cache size, and concludes that CLA is an effective method for improving the efficiency of transformer-based LLMs.
Reach us at info@study.space
[slides and audio] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention