RoFormer: Enhanced Transformer with Rotary Position Embedding

RoFormer: Enhanced Transformer with Rotary Position Embedding

November 9, 2023 | Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu
ROFORMER: Enhanced Transformer with Rotary Position Embedding This paper introduces a novel method called Rotary Position Embedding (RoPE) to enhance the performance of transformer-based language models (PLMs) by effectively leveraging positional information. RoPE encodes absolute position using a rotation matrix and incorporates explicit relative position dependency in self-attention. This method enables flexibility in sequence length, decaying inter-token dependency with increasing relative distances, and the ability to equip linear self-attention with relative position encoding. The enhanced transformer with rotary position embedding, called RoFormer, is evaluated on various long text classification benchmark datasets and consistently outperforms its alternatives. Theoretical analysis supports the effectiveness of RoPE. RoFormer is integrated into HuggingFace. Keywords: Pre-trained Language Models, Position Information Encoding, Pre-training, Natural Language Processing. The paper investigates various methods to integrate positional information into the learning process of transformer-based language models. It proposes RoPE, which encodes absolute position with a rotation matrix and incorporates explicit relative position dependency in self-attention. RoPE enables valuable properties, including flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Experiments on various long text classification benchmark datasets show that RoFormer consistently outperforms its alternatives. Theoretical analysis explains some experimental results. RoFormer is integrated into HuggingFace. The paper discusses the proposed rotary position embedding (RoPE). It formulates the relative position encoding problem and derives RoPE, investigating its properties. RoPE is shown to provide long-term decay, which means the inner-product will decay when the relative position increases. RoPE is compatible with linear attention and can be incorporated into the self-attention mechanism. Theoretical analysis explains the derivation of RoPE under 2D and its computational efficient realization. The long-term decay of RoPE is shown to be effective in reducing the inner-product as the relative position increases. Experiments on various NLP tasks show that RoFormer performs better than the baseline model. It is evaluated on machine translation, pre-training language modeling, fine-tuning on GLUE tasks, and performance with PerFormer. Results show that RoFormer significantly outperforms BERT in three out of six datasets. When combined with PerFormer, RoPE leads to rapid convergence and lower loss. Evaluation on Chinese data shows that RoFormer can handle long texts effectively. Pre-training on Chinese data and evaluation on Chinese AI and Law 2019 Similar Case Matching (CAIL2019-SCM) dataset demonstrate that RoFormer outperforms other models on long texts. The paper concludes that the proposed RoFormer enhances the performance of transformer architectures by incorporating explicit relative position dependency in self-attention.ROFORMER: Enhanced Transformer with Rotary Position Embedding This paper introduces a novel method called Rotary Position Embedding (RoPE) to enhance the performance of transformer-based language models (PLMs) by effectively leveraging positional information. RoPE encodes absolute position using a rotation matrix and incorporates explicit relative position dependency in self-attention. This method enables flexibility in sequence length, decaying inter-token dependency with increasing relative distances, and the ability to equip linear self-attention with relative position encoding. The enhanced transformer with rotary position embedding, called RoFormer, is evaluated on various long text classification benchmark datasets and consistently outperforms its alternatives. Theoretical analysis supports the effectiveness of RoPE. RoFormer is integrated into HuggingFace. Keywords: Pre-trained Language Models, Position Information Encoding, Pre-training, Natural Language Processing. The paper investigates various methods to integrate positional information into the learning process of transformer-based language models. It proposes RoPE, which encodes absolute position with a rotation matrix and incorporates explicit relative position dependency in self-attention. RoPE enables valuable properties, including flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Experiments on various long text classification benchmark datasets show that RoFormer consistently outperforms its alternatives. Theoretical analysis explains some experimental results. RoFormer is integrated into HuggingFace. The paper discusses the proposed rotary position embedding (RoPE). It formulates the relative position encoding problem and derives RoPE, investigating its properties. RoPE is shown to provide long-term decay, which means the inner-product will decay when the relative position increases. RoPE is compatible with linear attention and can be incorporated into the self-attention mechanism. Theoretical analysis explains the derivation of RoPE under 2D and its computational efficient realization. The long-term decay of RoPE is shown to be effective in reducing the inner-product as the relative position increases. Experiments on various NLP tasks show that RoFormer performs better than the baseline model. It is evaluated on machine translation, pre-training language modeling, fine-tuning on GLUE tasks, and performance with PerFormer. Results show that RoFormer significantly outperforms BERT in three out of six datasets. When combined with PerFormer, RoPE leads to rapid convergence and lower loss. Evaluation on Chinese data shows that RoFormer can handle long texts effectively. Pre-training on Chinese data and evaluation on Chinese AI and Law 2019 Similar Case Matching (CAIL2019-SCM) dataset demonstrate that RoFormer outperforms other models on long texts. The paper concludes that the proposed RoFormer enhances the performance of transformer architectures by incorporating explicit relative position dependency in self-attention.
Reach us at info@study.space