[slides] Base of RoPE Bounds Context Length

This paper explores the relationship between the base of Rotary Position Embedding (RoPE) and the context length in large language models (LLMs). RoPE, a technique that encodes position information using rotation matrices, has been widely used to extend the context length of LLMs. However, the authors find that the effectiveness of this extension is often superficial and based on out-of-distribution (OOD) theory. They propose a novel property of long-term decay in RoPE, which indicates that the model's ability to attend more to similar tokens than random tokens decreases as the relative distance increases. Based on this property, they derive an absolute lower bound for the base value of RoPE to achieve a certain context length capability. Theoretical and empirical results show that this lower bound holds both in fine-tuning and pre-training stages, providing insights into the long context capability of LLMs. The findings suggest that the base of RoPE bounds the context length, and a smaller base may lead to superficial long-context capability, preserving low perplexity but losing the ability to retrieve information from long contexts. The paper contributes to a deeper understanding of RoPE's role in LLMs and offers practical guidelines for optimizing context length in long-context modeling.This paper explores the relationship between the base of Rotary Position Embedding (RoPE) and the context length in large language models (LLMs). RoPE, a technique that encodes position information using rotation matrices, has been widely used to extend the context length of LLMs. However, the authors find that the effectiveness of this extension is often superficial and based on out-of-distribution (OOD) theory. They propose a novel property of long-term decay in RoPE, which indicates that the model's ability to attend more to similar tokens than random tokens decreases as the relative distance increases. Based on this property, they derive an absolute lower bound for the base value of RoPE to achieve a certain context length capability. Theoretical and empirical results show that this lower bound holds both in fine-tuning and pre-training stages, providing insights into the long context capability of LLMs. The findings suggest that the base of RoPE bounds the context length, and a smaller base may lead to superficial long-context capability, preserving low perplexity but losing the ability to retrieve information from long contexts. The paper contributes to a deeper understanding of RoPE's role in LLMs and offers practical guidelines for optimizing context length in long-context modeling.

Base of RoPE Bounds Context Length

23 May 2024 | Xin Men*, Mingyu Xu*, Bingning Wang*,† Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen

23 May 2024 | Xin Men, Mingyu Xu, Bingning Wang*,† Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen