[slides] Leave No Context Behind%3A Efficient Infinite Context Transformers with Infini-attention

This paper introduces Infini-attention, a novel attention mechanism designed to enable large language models (LLMs) to process infinitely long inputs with bounded memory and computation. The key innovation is the integration of compressive memory into the vanilla attention mechanism, allowing for both masked local attention and long-term linear attention in a single Transformer block. This approach maintains all key, value, and query states from standard attention computation, reusing them for long-term memory consolidation and retrieval. The Infini-attention mechanism is evaluated on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval, and 500K length book summarization tasks. The results show that the approach outperforms baseline models with a 114x reduction in memory size and achieves superior perplexity scores. The 1B LLM scales to 1M sequence length and successfully solves the passkey retrieval task, while the 8B model achieves new state-of-the-art results on the 500K length book summarization task. The contributions of the work include a practical and powerful attention mechanism, minimal changes to standard scaled dot-product attention, and the ability to scale LLMs to infinitely long contexts with bounded resources.This paper introduces Infini-attention, a novel attention mechanism designed to enable large language models (LLMs) to process infinitely long inputs with bounded memory and computation. The key innovation is the integration of compressive memory into the vanilla attention mechanism, allowing for both masked local attention and long-term linear attention in a single Transformer block. This approach maintains all key, value, and query states from standard attention computation, reusing them for long-term memory consolidation and retrieval. The Infini-attention mechanism is evaluated on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval, and 500K length book summarization tasks. The results show that the approach outperforms baseline models with a 114x reduction in memory size and achieves superior perplexity scores. The 1B LLM scales to 1M sequence length and successfully solves the passkey retrieval task, while the 8B model achieves new state-of-the-art results on the 500K length book summarization task. The contributions of the work include a practical and powerful attention mechanism, minimal changes to standard scaled dot-product attention, and the ability to scale LLMs to infinitely long contexts with bounded resources.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

9 Aug 2024 | Tsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal