Rethinking Attention with Performers

Rethinking Attention with Performers

19 Nov 2022 | Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
We introduce Performers, a new type of Transformer architecture that can estimate regular (softmax) full-rank attention with provable accuracy, using only linear space and time complexity, without relying on any prior assumptions such as sparsity or low-rankness. Performers use a novel method called FAVOR+ (Fast Attention Via positive Orthogonal Random features) to approximate softmax attention kernels. This method allows for efficient modeling of kernelizable attention mechanisms beyond softmax, enabling accurate comparisons of softmax with other kernels on large-scale tasks that are beyond the reach of regular Transformers. Performers are linear architectures fully compatible with regular Transformers and provide strong theoretical guarantees, including unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. We tested Performers on a variety of tasks, including pixel prediction, text models, and protein sequence modeling, demonstrating competitive results with other efficient sparse and dense attention methods. Performers leverage the FAVOR+ mechanism, which uses positive orthogonal random features to approximate softmax kernels, and can be applied beyond Transformers for scalable attention mechanisms in areas such as computer vision, reinforcement learning, and combinatorial optimization. The FAVOR+ mechanism is robust and efficient, with theoretical guarantees that ensure low variance and accurate approximation of attention matrices. Performers achieve nearly optimal speed and memory efficiency, making them suitable for large-scale tasks. The method is also applicable to other attention mechanisms and has broad implications for the development of efficient Transformer architectures.We introduce Performers, a new type of Transformer architecture that can estimate regular (softmax) full-rank attention with provable accuracy, using only linear space and time complexity, without relying on any prior assumptions such as sparsity or low-rankness. Performers use a novel method called FAVOR+ (Fast Attention Via positive Orthogonal Random features) to approximate softmax attention kernels. This method allows for efficient modeling of kernelizable attention mechanisms beyond softmax, enabling accurate comparisons of softmax with other kernels on large-scale tasks that are beyond the reach of regular Transformers. Performers are linear architectures fully compatible with regular Transformers and provide strong theoretical guarantees, including unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. We tested Performers on a variety of tasks, including pixel prediction, text models, and protein sequence modeling, demonstrating competitive results with other efficient sparse and dense attention methods. Performers leverage the FAVOR+ mechanism, which uses positive orthogonal random features to approximate softmax kernels, and can be applied beyond Transformers for scalable attention mechanisms in areas such as computer vision, reinforcement learning, and combinatorial optimization. The FAVOR+ mechanism is robust and efficient, with theoretical guarantees that ensure low variance and accurate approximation of attention matrices. Performers achieve nearly optimal speed and memory efficiency, making them suitable for large-scale tasks. The method is also applicable to other attention mechanisms and has broad implications for the development of efficient Transformer architectures.
Reach us at info@study.space
[slides and audio] Rethinking Attention with Performers