[slides] Dynamic Memory Compression%3A Retrofitting LLMs for Accelerated Inference

The paper introduces Dynamic Memory Compression (DMC), a method to reduce the size of the key-value (KV) cache in large language models (LLMs) during inference, thereby improving computational efficiency. DMC learns to decide whether to append or accumulate key-value representations at each time step, achieving up to 7× throughput increase on an NVIDIA H100 GPU for LLMs like Llama 2 (7B, 13B, and 70B). The method is retrofit into pre-trained LLMs without adding extra parameters and achieves up to 4× cache compression without significant performance degradation. DMC outperforms Grouped Query Attention (GQA) and key-value eviction policies (H₂O, TOVA) in terms of downstream performance and sample efficiency. The paper also explores the learned compression ratios across layers and heads, showing a preference for compressing deeper layers. DMC can be combined with GQA to achieve even higher compression ratios, making it a versatile and effective solution for improving the efficiency of LLMs.The paper introduces Dynamic Memory Compression (DMC), a method to reduce the size of the key-value (KV) cache in large language models (LLMs) during inference, thereby improving computational efficiency. DMC learns to decide whether to append or accumulate key-value representations at each time step, achieving up to 7× throughput increase on an NVIDIA H100 GPU for LLMs like Llama 2 (7B, 13B, and 70B). The method is retrofit into pre-trained LLMs without adding extra parameters and achieves up to 4× cache compression without significant performance degradation. DMC outperforms Grouped Query Attention (GQA) and key-value eviction policies (H₂O, TOVA) in terms of downstream performance and sample efficiency. The paper also explores the learned compression ratios across layers and heads, showing a preference for compressing deeper layers. DMC can be combined with GQA to achieve even higher compression ratios, making it a versatile and effective solution for improving the efficiency of LLMs.

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

2024 | Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti