28 Jun 2024 | Wonbeom Lee† Jungi Lee† Junghwan Seo Jaewoong Sim
InfiniGen is a novel dynamic KV cache management framework designed for long-text generation in large language models (LLMs). It synergistically works with modern offloading-based inference systems by leveraging the CPU memory to manage the KV cache more efficiently. The key insight behind InfiniGen is that a few important tokens, essential for computing the subsequent attention layer, can be speculated using minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows InfiniGen to prefetch only the necessary KV cache entries, reducing the overhead from host memory in offloading-based LLM serving systems. The evaluation on representative LLMs shows that InfiniGen improves overall performance by up to 3.00× compared to prior KV cache management methods while offering better model accuracy. The framework dynamically adjusts the number of KV entries to prefetch and manages the KV cache pool by removing infrequently used tokens, ensuring efficient use of CPU memory.InfiniGen is a novel dynamic KV cache management framework designed for long-text generation in large language models (LLMs). It synergistically works with modern offloading-based inference systems by leveraging the CPU memory to manage the KV cache more efficiently. The key insight behind InfiniGen is that a few important tokens, essential for computing the subsequent attention layer, can be speculated using minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows InfiniGen to prefetch only the necessary KV cache entries, reducing the overhead from host memory in offloading-based LLM serving systems. The evaluation on representative LLMs shows that InfiniGen improves overall performance by up to 3.00× compared to prior KV cache management methods while offering better model accuracy. The framework dynamically adjusts the number of KV entries to prefetch and manages the KV cache pool by removing infrequently used tokens, ensuring efficient use of CPU memory.