2024 | Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
Dynamic Memory Compression (DMC) is a method to reduce the size of the key-value (KV) cache in large language models (LLMs), enhancing inference efficiency. DMC allows LLMs to dynamically compress the KV cache during inference by deciding whether to append new tokens or merge them with the top element in the cache. This approach enables LLMs to maintain performance while reducing memory usage and increasing throughput. DMC is implemented by retrofitting pre-trained LLMs such as Llama 2 (7B, 13B, and 70B) with a negligible amount of additional data and without adding new parameters. The method achieves up to 7× throughput increase on an NVIDIA H100 GPU. DMC outperforms existing methods like Grouped Query Attention (GQA), key-value eviction policies (H2O, TOVA), and can be combined with GQA to achieve compounded gains. DMC preserves downstream performance with up to 4× cache compression and allows fitting longer contexts and larger batches within memory constraints. The method is efficient, memory-aware, and can be applied to various LLM sizes. DMC is a drop-in replacement for KV caching, improving inference efficiency and enabling more efficient use of hardware resources.Dynamic Memory Compression (DMC) is a method to reduce the size of the key-value (KV) cache in large language models (LLMs), enhancing inference efficiency. DMC allows LLMs to dynamically compress the KV cache during inference by deciding whether to append new tokens or merge them with the top element in the cache. This approach enables LLMs to maintain performance while reducing memory usage and increasing throughput. DMC is implemented by retrofitting pre-trained LLMs such as Llama 2 (7B, 13B, and 70B) with a negligible amount of additional data and without adding new parameters. The method achieves up to 7× throughput increase on an NVIDIA H100 GPU. DMC outperforms existing methods like Grouped Query Attention (GQA), key-value eviction policies (H2O, TOVA), and can be combined with GQA to achieve compounded gains. DMC preserves downstream performance with up to 4× cache compression and allows fitting longer contexts and larger batches within memory constraints. The method is efficient, memory-aware, and can be applied to various LLM sizes. DMC is a drop-in replacement for KV caching, improving inference efficiency and enabling more efficient use of hardware resources.