19 Nov 2022 | Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
The paper introduces Performers, a new type of Transformer architecture that can estimate regular (softmax) full-rank attention with provable accuracy but using only linear space and time complexity, without relying on priors like sparsity or low-rankness. Performers use a novel Fast Attention Via positive Orthogonal Random features (FAVOR+) approach to approximate softmax attention kernels, which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model other kernelizable attention mechanisms beyond softmax. This representational power is crucial for accurately comparing softmax with other kernels on large-scale tasks and investigating optimal attention kernels. Performers are linear architectures fully compatible with regular Transformers and have strong theoretical guarantees, including unbiased or nearly unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. The authors test Performers on a wide range of tasks, demonstrating competitive results with other efficient sparse and dense attention methods, showcasing the effectiveness of the novel attention-learning paradigm.The paper introduces Performers, a new type of Transformer architecture that can estimate regular (softmax) full-rank attention with provable accuracy but using only linear space and time complexity, without relying on priors like sparsity or low-rankness. Performers use a novel Fast Attention Via positive Orthogonal Random features (FAVOR+) approach to approximate softmax attention kernels, which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model other kernelizable attention mechanisms beyond softmax. This representational power is crucial for accurately comparing softmax with other kernels on large-scale tasks and investigating optimal attention kernels. Performers are linear architectures fully compatible with regular Transformers and have strong theoretical guarantees, including unbiased or nearly unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. The authors test Performers on a wide range of tasks, demonstrating competitive results with other efficient sparse and dense attention methods, showcasing the effectiveness of the novel attention-learning paradigm.