This paper introduces Sparse Transformers, a modified version of the Transformer architecture that allows for efficient modeling of long sequences. Traditional Transformers require quadratic time and memory with sequence length, but Sparse Transformers use sparse factorizations of the attention matrix to reduce this to O(n√n). The paper also introduces architectural changes such as restructured residual blocks, sparse attention kernels, and recomputation of attention matrices during the backwards pass to save memory. These modifications enable the model to handle sequences of tens of thousands of timesteps using hundreds of layers.
The Sparse Transformer architecture is applied to model images, audio, and text from raw bytes, achieving state-of-the-art results on density modeling tasks such as Enwik8, CIFAR10, and ImageNet-64. The model generates unconditional samples that demonstrate global coherence and diversity, and can theoretically model sequences of length one million or more.
The paper evaluates the performance of different attention patterns on various tasks, finding that sparse patterns converge to lower error and may indicate a useful inductive bias. The model is also shown to effectively incorporate long-term dependencies, as evidenced by improved performance with longer contexts and the ability to generate coherent samples from long sequences.
The Sparse Transformer is trained using mixed-precision techniques and is shown to be efficient in terms of both computation and memory. The model is evaluated on several tasks, including CIFAR-10, text, and ImageNet 64x64, where it achieves state-of-the-art results. The model is also tested on a classical music dataset, demonstrating its ability to scale to very long contexts. The results show that Sparse Transformers can achieve equivalent or better performance than standard Transformers while requiring significantly fewer operations.This paper introduces Sparse Transformers, a modified version of the Transformer architecture that allows for efficient modeling of long sequences. Traditional Transformers require quadratic time and memory with sequence length, but Sparse Transformers use sparse factorizations of the attention matrix to reduce this to O(n√n). The paper also introduces architectural changes such as restructured residual blocks, sparse attention kernels, and recomputation of attention matrices during the backwards pass to save memory. These modifications enable the model to handle sequences of tens of thousands of timesteps using hundreds of layers.
The Sparse Transformer architecture is applied to model images, audio, and text from raw bytes, achieving state-of-the-art results on density modeling tasks such as Enwik8, CIFAR10, and ImageNet-64. The model generates unconditional samples that demonstrate global coherence and diversity, and can theoretically model sequences of length one million or more.
The paper evaluates the performance of different attention patterns on various tasks, finding that sparse patterns converge to lower error and may indicate a useful inductive bias. The model is also shown to effectively incorporate long-term dependencies, as evidenced by improved performance with longer contexts and the ability to generate coherent samples from long sequences.
The Sparse Transformer is trained using mixed-precision techniques and is shown to be efficient in terms of both computation and memory. The model is evaluated on several tasks, including CIFAR-10, text, and ImageNet 64x64, where it achieves state-of-the-art results. The model is also tested on a classical music dataset, demonstrating its ability to scale to very long contexts. The results show that Sparse Transformers can achieve equivalent or better performance than standard Transformers while requiring significantly fewer operations.