13 Jan 2024 | Yikai Zhang, Junlong Li, Pengfei Liu
This paper addresses the limited extrapolation ability of Large Language Models (LLMs) beyond their pre-trained context window, which constrains their application in tasks requiring lengthy inputs. The authors propose a novel method to extend LLMs' context window by modifying Rotary Position Embedding (RoPE), a popular position encoding method. They identify the importance of maintaining stable attention entropy and introduce "entropy-aware ABF" (Adjusted Base Frequency), which combines dynamic attention scaling and scaling the attention logits. This method is validated through experiments on various context-demanding tasks, demonstrating superior fine-tuning performance and robustness across different context window sizes. Notably, the method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing high efficiency. The paper also explores the impact of data compositions and training curricula on context window extension, suggesting that fine-tuning on lengthy conversations is an effective starting point. The authors release their code and SFT data for further research.This paper addresses the limited extrapolation ability of Large Language Models (LLMs) beyond their pre-trained context window, which constrains their application in tasks requiring lengthy inputs. The authors propose a novel method to extend LLMs' context window by modifying Rotary Position Embedding (RoPE), a popular position encoding method. They identify the importance of maintaining stable attention entropy and introduce "entropy-aware ABF" (Adjusted Base Frequency), which combines dynamic attention scaling and scaling the attention logits. This method is validated through experiments on various context-demanding tasks, demonstrating superior fine-tuning performance and robustness across different context window sizes. Notably, the method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing high efficiency. The paper also explores the impact of data compositions and training curricula on context window extension, suggesting that fine-tuning on lengthy conversations is an effective starting point. The authors release their code and SFT data for further research.