[slides and audio] Transformers Can Achieve Length Generalization But Not Robustly

The paper "Transformers Can Achieve Length Generalization But Not Robustly" by Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou explores the ability of Transformers to generalize to longer sequences, a significant challenge in natural language processing and algorithmic tasks. The authors focus on the task of decimal addition, specifically N-digit decimal addition, to evaluate length generalization. They find that the success of length generalization is closely tied to the data format and position encoding used. Key findings include: 1. **Position Encoding and Data Format**: The combination of FIRE position encodings, randomized position encodings, reversed format, and index hints significantly enhances length generalization. FIRE encodings, in particular, enable near-perfect generalization to 100-digit addition, a 2.5× extension of the training sequence length. 2. **Robustness Issues**: Despite achieving strong generalization, the robustness of this approach is fragile. Random weight initialization and training data order significantly impact performance, leading to large variances across different random seeds. 3. **Model Size and Regularization**: Increasing model size does not consistently improve length generalization, and stronger regularization techniques like higher weight decay values only modestly enhance performance. The paper concludes that while Transformers can achieve length generalization, it is not robust and heavily relies on specific combinations of position encoding and data format. The findings highlight the need for more robust methods to address the challenges of length generalization in Transformers.The paper "Transformers Can Achieve Length Generalization But Not Robustly" by Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou explores the ability of Transformers to generalize to longer sequences, a significant challenge in natural language processing and algorithmic tasks. The authors focus on the task of decimal addition, specifically N-digit decimal addition, to evaluate length generalization. They find that the success of length generalization is closely tied to the data format and position encoding used. Key findings include: 1. **Position Encoding and Data Format**: The combination of FIRE position encodings, randomized position encodings, reversed format, and index hints significantly enhances length generalization. FIRE encodings, in particular, enable near-perfect generalization to 100-digit addition, a 2.5× extension of the training sequence length. 2. **Robustness Issues**: Despite achieving strong generalization, the robustness of this approach is fragile. Random weight initialization and training data order significantly impact performance, leading to large variances across different random seeds. 3. **Model Size and Regularization**: Increasing model size does not consistently improve length generalization, and stronger regularization techniques like higher weight decay values only modestly enhance performance. The paper concludes that while Transformers can achieve length generalization, it is not robust and heavily relies on specific combinations of position encoding and data format. The findings highlight the need for more robust methods to address the challenges of length generalization in Transformers.

Transformers Can Achieve Length Generalization But Not Robustly

14 Feb 2024 | Yongchao Zhou1,2, Uri Alon1, Xinyun Chen1, Xuezhi Wang1, Rishabh Agarwal1 and Denny Zhou1