Repeat After Me: Transformers are Better than State Space Models at Copying

Repeat After Me: Transformers are Better than State Space Models at Copying

2024 | Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
Transformers outperform state space models (GSSMs) in copying tasks due to their ability to handle exponentially long sequences, while GSSMs are limited by their fixed-size latent state. Theoretical analysis shows that transformers can copy strings of exponential length, whereas GSSMs cannot. Empirical results on synthetic tasks and pre-trained models confirm that transformers are more efficient and generalize better to longer inputs. Transformers also outperform GSSMs in retrieving information from context, as demonstrated by experiments with pre-trained models like Pythia and Mamba. These findings highlight a fundamental gap between transformers and GSSMs in practical tasks requiring context access. Theoretical and experimental results suggest that transformers are more effective at learning and generalizing copying tasks, while GSSMs struggle with memory-intensive tasks due to their fixed state size. This indicates that transformers are better suited for tasks requiring access to long input sequences.Transformers outperform state space models (GSSMs) in copying tasks due to their ability to handle exponentially long sequences, while GSSMs are limited by their fixed-size latent state. Theoretical analysis shows that transformers can copy strings of exponential length, whereas GSSMs cannot. Empirical results on synthetic tasks and pre-trained models confirm that transformers are more efficient and generalize better to longer inputs. Transformers also outperform GSSMs in retrieving information from context, as demonstrated by experiments with pre-trained models like Pythia and Mamba. These findings highlight a fundamental gap between transformers and GSSMs in practical tasks requiring context access. Theoretical and experimental results suggest that transformers are more effective at learning and generalizing copying tasks, while GSSMs struggle with memory-intensive tasks due to their fixed state size. This indicates that transformers are better suited for tasks requiring access to long input sequences.
Reach us at info@study.space