21 May 2024 | William Brandon*, Mayank Mishra*, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan-Kelley
The paper introduces Cross-Layer Attention (CLA), a novel method to reduce the size of the Key-Value (KV) cache in transformer-based autoregressive large language models (LLMs). CLA shares key and value heads across adjacent layers, further reducing the number of distinct key/value heads compared to Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). The authors demonstrate that CLA can reduce the KV cache size by another 2× while maintaining nearly the same accuracy as unmodified MQA. Experiments on 1B- and 3B-parameter models show that CLA provides a Pareto improvement over the memory/accuracy trade-offs possible with traditional MQA, enabling longer sequence lengths and larger batch sizes. The paper also explores different configurations of CLA and their effects on accuracy and memory usage, finding that a sharing factor of 2 is more effective. CLA is shown to be compatible with standard tensor parallelism techniques and has minor effects on other resources consumed by the model during training and inference. The authors conclude that CLA is an effective method for reducing the KV cache memory footprint, advancing the Pareto frontier for memory-efficient transformers.The paper introduces Cross-Layer Attention (CLA), a novel method to reduce the size of the Key-Value (KV) cache in transformer-based autoregressive large language models (LLMs). CLA shares key and value heads across adjacent layers, further reducing the number of distinct key/value heads compared to Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). The authors demonstrate that CLA can reduce the KV cache size by another 2× while maintaining nearly the same accuracy as unmodified MQA. Experiments on 1B- and 3B-parameter models show that CLA provides a Pareto improvement over the memory/accuracy trade-offs possible with traditional MQA, enabling longer sequence lengths and larger batch sizes. The paper also explores different configurations of CLA and their effects on accuracy and memory usage, finding that a sharing factor of 2 is more effective. CLA is shown to be compatible with standard tensor parallelism techniques and has minor effects on other resources consumed by the model during training and inference. The authors conclude that CLA is an effective method for reducing the KV cache memory footprint, advancing the Pareto frontier for memory-efficient transformers.