21 Feb 2024 | Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang
LongRoPE is a method that extends the context window of large language models (LLMs) to an unprecedented 2048k tokens while maintaining performance at shorter context lengths. The method introduces three key innovations: (1) it exploits two forms of non-uniformities in positional interpolation through an efficient search, providing better initialization for fine-tuning and enabling an 8× extension in non-fine-tuning scenarios; (2) it introduces a progressive extension strategy that first fine-tunes a 256k-length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (3) it readjusts LongRoPE on 8k-length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of the method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations. Code will be available at https://github.com/microsoft/LongRoPE.LongRoPE is a method that extends the context window of large language models (LLMs) to an unprecedented 2048k tokens while maintaining performance at shorter context lengths. The method introduces three key innovations: (1) it exploits two forms of non-uniformities in positional interpolation through an efficient search, providing better initialization for fine-tuning and enabling an 8× extension in non-fine-tuning scenarios; (2) it introduces a progressive extension strategy that first fine-tunes a 256k-length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (3) it readjusts LongRoPE on 8k-length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of the method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations. Code will be available at https://github.com/microsoft/LongRoPE.