SubGen: Token Generation in Sublinear Time and Memory

SubGen: Token Generation in Sublinear Time and Memory

8 Feb 2024 | Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
SubGen is a novel algorithm for token generation in sublinear time and memory, designed to address the memory and computational challenges of large language models (LLMs) in long-context scenarios. LLMs require storing all previous tokens in the attention module for key-value (KV) caching, which increases linearly with context length, making them inefficient for long sequences. SubGen introduces a sublinear complexity approach by leveraging the clustering tendency of key embeddings and using online clustering and ℓ₂ sampling on values. This results in a provably accurate and efficient attention decoding algorithm that reduces memory and time complexity, with a tight error bound. Empirical evaluations on long-context question-answering tasks show that SubGen outperforms existing KV cache compression methods in both performance and efficiency. The algorithm uses a streaming approach to maintain a small subset of keys and values, enabling sublinear memory and time complexity. It employs a data structure that maintains and updates key-value pairs efficiently, using reservoir sampling and clustering techniques. The algorithm's correctness is supported by theoretical analysis, and it is shown to operate with sublinear memory and runtime under certain assumptions about the input data. Experiments on benchmark datasets confirm that SubGen achieves higher accuracy with reduced memory usage compared to other methods.SubGen is a novel algorithm for token generation in sublinear time and memory, designed to address the memory and computational challenges of large language models (LLMs) in long-context scenarios. LLMs require storing all previous tokens in the attention module for key-value (KV) caching, which increases linearly with context length, making them inefficient for long sequences. SubGen introduces a sublinear complexity approach by leveraging the clustering tendency of key embeddings and using online clustering and ℓ₂ sampling on values. This results in a provably accurate and efficient attention decoding algorithm that reduces memory and time complexity, with a tight error bound. Empirical evaluations on long-context question-answering tasks show that SubGen outperforms existing KV cache compression methods in both performance and efficiency. The algorithm uses a streaming approach to maintain a small subset of keys and values, enabling sublinear memory and time complexity. It employs a data structure that maintains and updates key-value pairs efficiently, using reservoir sampling and clustering techniques. The algorithm's correctness is supported by theoretical analysis, and it is shown to operate with sublinear memory and runtime under certain assumptions about the input data. Experiments on benchmark datasets confirm that SubGen achieves higher accuracy with reduced memory usage compared to other methods.
Reach us at info@study.space
Understanding SubGen%3A Token Generation in Sublinear Time and Memory