Length Generalization of Causal Transformers without Position Encoding

Length Generalization of Causal Transformers without Position Encoding

28 May 2024 | Jie Wang1*, Tao Ji2*, Yuanbin Wu1, Hang Yan5, Tao Gui1, Qi Zhang2, Xuanjing Huang2,4, Xiaoling Wang1
This paper explores the length generalization property of transformers without explicit position encoding (NoPE). The authors find that while NoPE can extend to longer sequences compared to explicit position encodings, it still has a limited context length. They identify a connection between the failure of NoPE's generalization and the distraction of attention distributions, where attention heads begin to allocate weights evenly when NoPE's extension performance collapses. To address this, they propose a parameter-efficient tuning method to search for the best temperature hyper-parameters for attention heads, which significantly expands NoPE's context size. Experiments on long sequence language modeling, synthetic passkey retrieval tasks, and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly available.This paper explores the length generalization property of transformers without explicit position encoding (NoPE). The authors find that while NoPE can extend to longer sequences compared to explicit position encodings, it still has a limited context length. They identify a connection between the failure of NoPE's generalization and the distraction of attention distributions, where attention heads begin to allocate weights evenly when NoPE's extension performance collapses. To address this, they propose a parameter-efficient tuning method to search for the best temperature hyper-parameters for attention heads, which significantly expands NoPE's context size. Experiments on long sequence language modeling, synthetic passkey retrieval tasks, and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly available.
Reach us at info@study.space