2024 | Wonbeom Lee†, Jungi Lee†, Junghwan Seo, Jaewoong Sim
InfiniGen is a novel KV cache management framework designed to work synergistically with modern offloading-based inference systems for efficient large language model (LLM) inference. The key idea is to speculate on the essential KV cache entries needed for the next attention layer by performing minimal rehearsal with the current layer's inputs and part of the subsequent layer's query and key cache. This allows InfiniGen to prefetch only the essential KV cache entries, reducing the fetch overhead from host memory in offloading-based systems. Evaluation on several LLMs shows that InfiniGen improves performance by up to 3.00× compared to prior methods while offering better accuracy.
LLMs face challenges in serving long content due to the large memory footprint of the transient state, known as the key-value (KV) cache, which scales with sequence length and batch size. Modern LLM serving systems support offloading data to CPU memory to efficiently serve LLMs within hardware budgets. However, transferring the massive size of the KV cache from CPU to GPU becomes a performance bottleneck. InfiniGen addresses this by managing the KV cache pool on CPU memory, dynamically adjusting the number of KV entries to prefetch, and removing infrequently used tokens to alleviate GPU memory pressure.
InfiniGen leverages the Transformer architecture's query and key matrices to skew them, emphasizing important columns for more efficient attention computation. During the prefill stage, InfiniGen generates partial weights for use in the decoding stage. At Layer i-1, it speculates on the attention pattern of the next layer using the attention input of Layer i-1, a partial query weight, and a partial key cache of Layer i. Based on the speculated pattern, InfiniGen prefetches essential KV cache entries from CPU memory for attention computation at Layer i. By dynamically adjusting the number of KV entries to prefetch, InfiniGen reduces the overhead of KV cache transfer.
InfiniGen manages the KV cache pool by dynamically removing infrequently used tokens. It implements a counter-based policy for eviction, which balances accuracy and runtime overhead. Evaluation on two representative LLMs shows that InfiniGen achieves up to 3.00× speedup over existing methods while offering up to 32.6 percentage point increase in accuracy. InfiniGen consistently provides performance improvements with larger models, longer sequence lengths, and larger batch sizes, while prior compression-based methods lead to saturating speedups.InfiniGen is a novel KV cache management framework designed to work synergistically with modern offloading-based inference systems for efficient large language model (LLM) inference. The key idea is to speculate on the essential KV cache entries needed for the next attention layer by performing minimal rehearsal with the current layer's inputs and part of the subsequent layer's query and key cache. This allows InfiniGen to prefetch only the essential KV cache entries, reducing the fetch overhead from host memory in offloading-based systems. Evaluation on several LLMs shows that InfiniGen improves performance by up to 3.00× compared to prior methods while offering better accuracy.
LLMs face challenges in serving long content due to the large memory footprint of the transient state, known as the key-value (KV) cache, which scales with sequence length and batch size. Modern LLM serving systems support offloading data to CPU memory to efficiently serve LLMs within hardware budgets. However, transferring the massive size of the KV cache from CPU to GPU becomes a performance bottleneck. InfiniGen addresses this by managing the KV cache pool on CPU memory, dynamically adjusting the number of KV entries to prefetch, and removing infrequently used tokens to alleviate GPU memory pressure.
InfiniGen leverages the Transformer architecture's query and key matrices to skew them, emphasizing important columns for more efficient attention computation. During the prefill stage, InfiniGen generates partial weights for use in the decoding stage. At Layer i-1, it speculates on the attention pattern of the next layer using the attention input of Layer i-1, a partial query weight, and a partial key cache of Layer i. Based on the speculated pattern, InfiniGen prefetches essential KV cache entries from CPU memory for attention computation at Layer i. By dynamically adjusting the number of KV entries to prefetch, InfiniGen reduces the overhead of KV cache transfer.
InfiniGen manages the KV cache pool by dynamically removing infrequently used tokens. It implements a counter-based policy for eviction, which balances accuracy and runtime overhead. Evaluation on two representative LLMs shows that InfiniGen achieves up to 3.00× speedup over existing methods while offering up to 32.6 percentage point increase in accuracy. InfiniGen consistently provides performance improvements with larger models, longer sequence lengths, and larger batch sizes, while prior compression-based methods lead to saturating speedups.