14 Feb 2024 | Yongchao Zhou¹,², Uri Alon, Xinyun Chen¹, Xuezhi Wang¹, Rishabh Agarwal¹ and Denny Zhou¹
Transformers can achieve length generalization but not robustly. This study demonstrates that standard Transformers can extrapolate to sequences 2.5 times longer than their input length when using appropriate data formats and position encodings, such as FIRE and randomized position encodings. However, length generalization is fragile and sensitive to factors like random weight initialization and training data order, leading to significant performance variations across different random seeds.
The research focuses on the N-digit decimal addition task, treating it as a form of synthetic language learning. It shows that the success of length generalization is closely tied to the choice of position encoding and data format. The study introduces a recipe for successful length generalization, combining FIRE position encodings, randomized position encodings, reversed format, and index hints. These components enable the model to generalize to sequences up to 100 digits, with a length extension ratio of 2.5 times the input length.
The study also highlights the importance of data formatting, such as reversed format, which aligns better with the natural order of computing digits. Index hints further aid in operand identification, while random space augmentation and randomized position encoding have mixed effects on generalization. The results show that while Transformers can achieve near-perfect accuracy in 100-digit addition, their performance is highly variable and sensitive to initialization and training data order.
Despite these advancements, robust length generalization remains challenging. The study finds that model size and regularization techniques have limited impact on generalization, and that the performance of Transformers is significantly influenced by the randomness of weight initialization and training data order. The research underscores the importance of combining appropriate data formats and position encodings to achieve effective length generalization, but highlights the fragility of this capability in real-world scenarios.Transformers can achieve length generalization but not robustly. This study demonstrates that standard Transformers can extrapolate to sequences 2.5 times longer than their input length when using appropriate data formats and position encodings, such as FIRE and randomized position encodings. However, length generalization is fragile and sensitive to factors like random weight initialization and training data order, leading to significant performance variations across different random seeds.
The research focuses on the N-digit decimal addition task, treating it as a form of synthetic language learning. It shows that the success of length generalization is closely tied to the choice of position encoding and data format. The study introduces a recipe for successful length generalization, combining FIRE position encodings, randomized position encodings, reversed format, and index hints. These components enable the model to generalize to sequences up to 100 digits, with a length extension ratio of 2.5 times the input length.
The study also highlights the importance of data formatting, such as reversed format, which aligns better with the natural order of computing digits. Index hints further aid in operand identification, while random space augmentation and randomized position encoding have mixed effects on generalization. The results show that while Transformers can achieve near-perfect accuracy in 100-digit addition, their performance is highly variable and sensitive to initialization and training data order.
Despite these advancements, robust length generalization remains challenging. The study finds that model size and regularization techniques have limited impact on generalization, and that the performance of Transformers is significantly influenced by the randomness of weight initialization and training data order. The research underscores the importance of combining appropriate data formats and position encodings to achieve effective length generalization, but highlights the fragility of this capability in real-world scenarios.