Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

4 Jun 2024 | Haoyi Wu and Kewei Tu
The paper introduces a novel method to reduce memory consumption and improve inference throughput for large language models (LLMs) by significantly reducing the number of layers whose key-value (KV) cache needs to be computed and cached. The proposed method pairs queries from all layers with keys and values from only the top layer, eliminating the need to cache or compute KVs for most layers, thus saving both memory and computation. The authors also address the challenge of training by designing an approximate training method that supports parallel training, which is crucial for efficient inference. Experiments on the Llama model show that the proposed method achieves up to 32× larger batch sizes and 26× higher throughput compared to standard transformers, while maintaining competitive performance in language modeling and downstream tasks. The method is also shown to integrate well with other memory-saving techniques, such as StreamingLLM, further improving inference efficiency. The paper concludes with a discussion of the trade-offs and limitations of the proposed method, highlighting its potential for improving inference efficiency in LLMs.The paper introduces a novel method to reduce memory consumption and improve inference throughput for large language models (LLMs) by significantly reducing the number of layers whose key-value (KV) cache needs to be computed and cached. The proposed method pairs queries from all layers with keys and values from only the top layer, eliminating the need to cache or compute KVs for most layers, thus saving both memory and computation. The authors also address the challenge of training by designing an approximate training method that supports parallel training, which is crucial for efficient inference. Experiments on the Llama model show that the proposed method achieves up to 32× larger batch sizes and 26× higher throughput compared to standard transformers, while maintaining competitive performance in language modeling and downstream tasks. The method is also shown to integrate well with other memory-saving techniques, such as StreamingLLM, further improving inference efficiency. The paper concludes with a discussion of the trade-offs and limitations of the proposed method, highlighting its potential for improving inference efficiency in LLMs.
Reach us at info@study.space
Understanding Layer-Condensed KV Cache for Efficient Inference of Large Language Models