28 May 2024 | Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang
This paper investigates the length generalization capability of Causal Transformers without position encoding (NoPE). The study finds that while NoPE can extend to longer sequences than explicit position encodings, it still has a limited context length. The failure of NoPE's generalization is linked to the distraction of attention distributions. The authors propose a parameter-efficient tuning method to search for the best temperature hyper-parameters for attention heads, which significantly improves NoPE's context size. Experiments on long sequence language modeling, synthetic tasks, and real-world long context tasks show that NoPE can achieve competitive performance with state-of-the-art length generalization algorithms. The results indicate that NoPE can generalize to longer sequences by adjusting the temperature hyper-parameters of the attention mechanism. The study also explores head-based attention scaling, which further enhances NoPE's generalization ability. The findings suggest that NoPE has significant potential for length generalization, although it still faces challenges in handling extremely long contexts due to computational and memory constraints. The paper contributes to the understanding of NoPE's generalization capabilities and provides a new approach for improving length generalization in causal transformers.This paper investigates the length generalization capability of Causal Transformers without position encoding (NoPE). The study finds that while NoPE can extend to longer sequences than explicit position encodings, it still has a limited context length. The failure of NoPE's generalization is linked to the distraction of attention distributions. The authors propose a parameter-efficient tuning method to search for the best temperature hyper-parameters for attention heads, which significantly improves NoPE's context size. Experiments on long sequence language modeling, synthetic tasks, and real-world long context tasks show that NoPE can achieve competitive performance with state-of-the-art length generalization algorithms. The results indicate that NoPE can generalize to longer sequences by adjusting the temperature hyper-parameters of the attention mechanism. The study also explores head-based attention scaling, which further enhances NoPE's generalization ability. The findings suggest that NoPE has significant potential for length generalization, although it still faces challenges in handling extremely long contexts due to computational and memory constraints. The paper contributes to the understanding of NoPE's generalization capabilities and provides a new approach for improving length generalization in causal transformers.