Generating Long Sequences with Sparse Transformers

Generating Long Sequences with Sparse Transformers

23 Apr 2019 | Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever
This paper introduces Sparse Transformers, a variant of the Transformer architecture that significantly reduces the computational and memory requirements for processing long sequences. By introducing sparse factorizations of the attention matrix, the complexity of sequence processing is reduced from quadratic to linearithmic ($O(n \sqrt{n})$). The authors also propose several architectural and initialization improvements, including deeper networks, recomputation of attention matrices during training, and fast attention kernels. These changes enable the model to handle sequences of tens of thousands of timesteps using hundreds of layers. The Sparse Transformer is evaluated on various tasks, including density modeling of images, text, and raw audio, achieving state-of-the-art performance. The model demonstrates the ability to learn long-term dependencies and generate diverse samples, showcasing its potential for handling sequences of up to one million timesteps.This paper introduces Sparse Transformers, a variant of the Transformer architecture that significantly reduces the computational and memory requirements for processing long sequences. By introducing sparse factorizations of the attention matrix, the complexity of sequence processing is reduced from quadratic to linearithmic ($O(n \sqrt{n})$). The authors also propose several architectural and initialization improvements, including deeper networks, recomputation of attention matrices during training, and fast attention kernels. These changes enable the model to handle sequences of tens of thousands of timesteps using hundreds of layers. The Sparse Transformer is evaluated on various tasks, including density modeling of images, text, and raw audio, achieving state-of-the-art performance. The model demonstrates the ability to learn long-term dependencies and generate diverse samples, showcasing its potential for handling sequences of up to one million timesteps.
Reach us at info@study.space