8 Feb 2024 | Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
The paper "SubGEN: Token Generation in Sublinear Time and Memory" addresses the challenge of deploying large language models (LLMs) in long-context token generation due to their high memory requirements. The authors propose SubGEN, an efficient method for compressing the key-value (KV) cache used in autoregressive attention decoding. SubGEN leverages the observation that key embeddings tend to cluster within the attention module, employing online clustering on key tokens and $\ell_2$ sampling on values. This approach ensures sublinear memory and time complexity while maintaining accuracy. Empirical evaluations on long-context question-answering tasks demonstrate that SubGEN outperforms existing KV cache compression methods in terms of performance and efficiency. The paper also includes a detailed analysis of the algorithm's correctness and complexity, along with experimental results showing its effectiveness in reducing memory footprint and improving decoding accuracy.The paper "SubGEN: Token Generation in Sublinear Time and Memory" addresses the challenge of deploying large language models (LLMs) in long-context token generation due to their high memory requirements. The authors propose SubGEN, an efficient method for compressing the key-value (KV) cache used in autoregressive attention decoding. SubGEN leverages the observation that key embeddings tend to cluster within the attention module, employing online clustering on key tokens and $\ell_2$ sampling on values. This approach ensures sublinear memory and time complexity while maintaining accuracy. Empirical evaluations on long-context question-answering tasks demonstrate that SubGEN outperforms existing KV cache compression methods in terms of performance and efficiency. The paper also includes a detailed analysis of the algorithm's correctness and complexity, along with experimental results showing its effectiveness in reducing memory footprint and improving decoding accuracy.