25 Mar 2024 | Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
This paper introduces LONGHEADs, a training-free framework that enhances the long context processing ability of large language models (LLMs) by leveraging the untapped potential of multi-head attention. Traditional methods for handling long contexts often struggle with out-of-distribution (OOD) issues and quadratic computational complexity. LONGHEADs addresses these challenges by allowing each attention head to process only relevant context chunks, rather than attending to the entire sentence. This approach reduces computational costs and avoids OOD issues, enabling efficient processing of long sequences in linear time.
The key idea of LONGHEADs is to utilize the inherent correlation between query and key representations to select and attend to important context chunks. By breaking the input into chunks and using chunk-level features, each head can focus on relevant parts of the context, ensuring effective processing within the pre-trained length. Different heads across layers collectively handle longer contexts, while maintaining linear computational complexity.
LONGHEADs is designed to work seamlessly with LLMs that use relative positional encoding. It achieves 100% accuracy on the passkey retrieval task at 128k length, demonstrating its effectiveness in extending the usable context window for existing models. Experiments show that LONGHEADs outperforms other restricted attention methods on long context benchmarks and achieves comparable performance to full attention methods with linear computational cost.
The framework is evaluated on various tasks, including language modeling, synthetic retrieval, and long context benchmarks. Results show that LONGHEADs achieves state-of-the-art performance on restricted attention methods and performs competitively with full attention methods. It successfully extends the context window of LLaMA-2-7B from 4k to 8 times its length, demonstrating its ability to generalize to longer sequences.
The chunk selection strategy in LONGHEADs ensures that important chunks are selected based on their relevance to the current token. This strategy is effective in maintaining performance across different sequence lengths and tasks. The framework also demonstrates flexibility in handling different chunk sizes and numbers, with four chunks providing sufficient information for performance.
LONGHEADs is a training-free inference framework that leverages the structural properties of attention heads to process long sequences efficiently. It is a SOTA restricted-attention-based long context processor that works efficiently in linear time and achieves comparable performance to full-attention methods. The results demonstrate that LONGHEADs enables LLMs to directly generalize to longer sequences and achieve comparable or even superior performance compared to methods requiring continuous fine-tuning.This paper introduces LONGHEADs, a training-free framework that enhances the long context processing ability of large language models (LLMs) by leveraging the untapped potential of multi-head attention. Traditional methods for handling long contexts often struggle with out-of-distribution (OOD) issues and quadratic computational complexity. LONGHEADs addresses these challenges by allowing each attention head to process only relevant context chunks, rather than attending to the entire sentence. This approach reduces computational costs and avoids OOD issues, enabling efficient processing of long sequences in linear time.
The key idea of LONGHEADs is to utilize the inherent correlation between query and key representations to select and attend to important context chunks. By breaking the input into chunks and using chunk-level features, each head can focus on relevant parts of the context, ensuring effective processing within the pre-trained length. Different heads across layers collectively handle longer contexts, while maintaining linear computational complexity.
LONGHEADs is designed to work seamlessly with LLMs that use relative positional encoding. It achieves 100% accuracy on the passkey retrieval task at 128k length, demonstrating its effectiveness in extending the usable context window for existing models. Experiments show that LONGHEADs outperforms other restricted attention methods on long context benchmarks and achieves comparable performance to full attention methods with linear computational cost.
The framework is evaluated on various tasks, including language modeling, synthetic retrieval, and long context benchmarks. Results show that LONGHEADs achieves state-of-the-art performance on restricted attention methods and performs competitively with full attention methods. It successfully extends the context window of LLaMA-2-7B from 4k to 8 times its length, demonstrating its ability to generalize to longer sequences.
The chunk selection strategy in LONGHEADs ensures that important chunks are selected based on their relevance to the current token. This strategy is effective in maintaining performance across different sequence lengths and tasks. The framework also demonstrates flexibility in handling different chunk sizes and numbers, with four chunks providing sufficient information for performance.
LONGHEADs is a training-free inference framework that leverages the structural properties of attention heads to process long sequences efficiently. It is a SOTA restricted-attention-based long context processor that works efficiently in linear time and achieves comparable performance to full-attention methods. The results demonstrate that LONGHEADs enables LLMs to directly generalize to longer sequences and achieve comparable or even superior performance compared to methods requiring continuous fine-tuning.