2024 | Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, Di He
This paper introduces BiPE, a novel bilevel positional encoding method designed to improve length extrapolation in language models. BiPE combines intra-segment and inter-segment encodings to better capture semantic information and segment relationships. The intra-segment encoding identifies the location of tokens within a segment using absolute positional encoding, while the inter-segment encoding specifies the segment index using relative positional encoding. This bilevel design aligns with the intrinsic segmentation of text data and enhances length extrapolation capabilities. Theoretical analysis supports the effectiveness of BiPE in parameter efficiency. Empirical results show that BiPE outperforms existing methods in various tasks, including mathematical reasoning, language modeling, and long context benchmarks. BiPE is compatible with different positional encoding schemes and can be integrated with fine-tuning strategies to further improve performance. The method is effective across different text modalities and does not negatively impact performance on normal-length text. Future work includes exploring hierarchical versions of BiPE and improving segmentation methods for general sequence data.This paper introduces BiPE, a novel bilevel positional encoding method designed to improve length extrapolation in language models. BiPE combines intra-segment and inter-segment encodings to better capture semantic information and segment relationships. The intra-segment encoding identifies the location of tokens within a segment using absolute positional encoding, while the inter-segment encoding specifies the segment index using relative positional encoding. This bilevel design aligns with the intrinsic segmentation of text data and enhances length extrapolation capabilities. Theoretical analysis supports the effectiveness of BiPE in parameter efficiency. Empirical results show that BiPE outperforms existing methods in various tasks, including mathematical reasoning, language modeling, and long context benchmarks. BiPE is compatible with different positional encoding schemes and can be integrated with fine-tuning strategies to further improve performance. The method is effective across different text modalities and does not negatively impact performance on normal-length text. Future work includes exploring hierarchical versions of BiPE and improving segmentation methods for general sequence data.