Understanding LongHeads%3A Multi-Head Attention is Secretly a Long Context Processor

The article introduces **LONGHEADS**, a training-free framework that enhances the long context processing capability of large language models (LLMs) by leveraging the untapped potential of multi-head attention. Traditional methods for handling long contexts often struggle with out-of-distribution (OOD) issues and quadratic computational complexity, leading to inefficiencies. LONGHEADS addresses these challenges by allowing each attention head to focus on relevant context chunks rather than the entire sequence, thereby avoiding OOD problems and reducing computational costs. It employs a chunk selection strategy based on the inherent correlation between query and key representations, enabling efficient distribution of context chunks across different heads. This approach allows each head to process within the pre-trained length while collectively handling longer contexts. LONGHEADS operates in linear time and integrates seamlessly with LLMs using relative positional encoding. It achieves 100% accuracy on the passkey retrieval task at 128k length, demonstrating its effectiveness in extending the usable context window for existing models. Experiments show that LONGHEADS performs competitively with full attention methods and achieves state-of-the-art results on restricted attention methods. The framework is evaluated on tasks such as language modeling, synthetic retrieval, and long context benchmarks, with results showing its superiority in handling long sequences without additional training. LONGHEADS is a promising solution for improving the efficiency and effectiveness of LLMs in processing long contexts.The article introduces **LONGHEADS**, a training-free framework that enhances the long context processing capability of large language models (LLMs) by leveraging the untapped potential of multi-head attention. Traditional methods for handling long contexts often struggle with out-of-distribution (OOD) issues and quadratic computational complexity, leading to inefficiencies. LONGHEADS addresses these challenges by allowing each attention head to focus on relevant context chunks rather than the entire sequence, thereby avoiding OOD problems and reducing computational costs. It employs a chunk selection strategy based on the inherent correlation between query and key representations, enabling efficient distribution of context chunks across different heads. This approach allows each head to process within the pre-trained length while collectively handling longer contexts. LONGHEADS operates in linear time and integrates seamlessly with LLMs using relative positional encoding. It achieves 100% accuracy on the passkey retrieval task at 128k length, demonstrating its effectiveness in extending the usable context window for existing models. Experiments show that LONGHEADS performs competitively with full attention methods and achieves state-of-the-art results on restricted attention methods. The framework is evaluated on tasks such as language modeling, synthetic retrieval, and long context benchmarks, with results showing its superiority in handling long sequences without additional training. LONGHEADS is a promising solution for improving the efficiency and effectiveness of LLMs in processing long contexts.

LONGHEADS: Multi-Head Attention is Secretly a Long Context Processor

25 Mar 2024 | Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang