This paper proposes SelfExtend, a method to extend the context window of large language models (LLMs) without fine-tuning. LLMs typically have limited context window lengths due to their training on fixed-length sequences. When processing longer input sequences during inference, LLMs often fail due to out-of-distribution (O.O.D.) positional encoding issues. SelfExtend addresses this by using a simple floor division operation to map unseen large relative positions to those encountered during pretraining, enabling LLMs to handle longer contexts naturally.
SelfExtend incorporates two types of attention mechanisms: grouped attention and standard attention. Grouped attention handles long-distance relationships between tokens by mapping positions to those seen during pretraining, while standard attention focuses on adjacent tokens within a specified range. This dual mechanism allows LLMs to maintain coherence over longer texts without additional fine-tuning.
Experiments on multiple benchmarks show that SelfExtend significantly improves the long context understanding ability of LLMs, often outperforming fine-tuning-based methods. It maintains performance on short-context tasks and enhances performance on long-context tasks. SelfExtend is a plug-and-play method that can be easily integrated into existing LLMs, making it a practical solution for extending context windows without the need for additional training. The results demonstrate that LLMs have inherent capabilities to handle long contexts, and SelfExtend effectively leverages these capabilities to extend context windows efficiently.This paper proposes SelfExtend, a method to extend the context window of large language models (LLMs) without fine-tuning. LLMs typically have limited context window lengths due to their training on fixed-length sequences. When processing longer input sequences during inference, LLMs often fail due to out-of-distribution (O.O.D.) positional encoding issues. SelfExtend addresses this by using a simple floor division operation to map unseen large relative positions to those encountered during pretraining, enabling LLMs to handle longer contexts naturally.
SelfExtend incorporates two types of attention mechanisms: grouped attention and standard attention. Grouped attention handles long-distance relationships between tokens by mapping positions to those seen during pretraining, while standard attention focuses on adjacent tokens within a specified range. This dual mechanism allows LLMs to maintain coherence over longer texts without additional fine-tuning.
Experiments on multiple benchmarks show that SelfExtend significantly improves the long context understanding ability of LLMs, often outperforming fine-tuning-based methods. It maintains performance on short-context tasks and enhances performance on long-context tasks. SelfExtend is a plug-and-play method that can be easily integrated into existing LLMs, making it a practical solution for extending context windows without the need for additional training. The results demonstrate that LLMs have inherent capabilities to handle long contexts, and SelfExtend effectively leverages these capabilities to extend context windows efficiently.