9 Aug 2024 | Tsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal
This paper introduces Infini-attention, a new attention mechanism that enables Transformer-based Large Language Models (LLMs) to process infinitely long inputs with bounded memory and computation. Infini-attention incorporates a compressive memory into the standard attention mechanism and integrates both masked local attention and long-term linear attention within a single Transformer block. The approach allows for efficient processing of long-context tasks such as language modeling, book summarization, and passkey retrieval. The method achieves significant memory efficiency, with a 114x compression ratio in terms of memory size, and enables fast streaming inference for LLMs. The Infini-attention mechanism reuses key, value, and query states for long-term memory consolidation and retrieval, allowing the model to maintain an entire context history with a compressive memory. The approach outperforms baseline models on long-context language modeling benchmarks and achieves new state-of-the-art results on a 500K-length book summarization task after continual pre-training and fine-tuning. The method also enables LLMs to scale to infinitely long contexts with bounded memory and computation resources by processing extremely long inputs in a streaming fashion. The paper compares Infini-attention with other models and shows that it achieves better performance on various tasks, including long-context language modeling, passkey retrieval, and book summarization. The approach is efficient, scalable, and enables LLMs to handle extremely long input sequences with minimal memory and computational overhead.This paper introduces Infini-attention, a new attention mechanism that enables Transformer-based Large Language Models (LLMs) to process infinitely long inputs with bounded memory and computation. Infini-attention incorporates a compressive memory into the standard attention mechanism and integrates both masked local attention and long-term linear attention within a single Transformer block. The approach allows for efficient processing of long-context tasks such as language modeling, book summarization, and passkey retrieval. The method achieves significant memory efficiency, with a 114x compression ratio in terms of memory size, and enables fast streaming inference for LLMs. The Infini-attention mechanism reuses key, value, and query states for long-term memory consolidation and retrieval, allowing the model to maintain an entire context history with a compressive memory. The approach outperforms baseline models on long-context language modeling benchmarks and achieves new state-of-the-art results on a 500K-length book summarization task after continual pre-training and fine-tuning. The method also enables LLMs to scale to infinitely long contexts with bounded memory and computation resources by processing extremely long inputs in a streaming fashion. The paper compares Infini-attention with other models and shows that it achieves better performance on various tasks, including long-context language modeling, passkey retrieval, and book summarization. The approach is efficient, scalable, and enables LLMs to handle extremely long input sequences with minimal memory and computational overhead.