[slides] Stand-Alone Self-Attention in Vision Models

This paper explores the potential of stand-alone self-attention as a primary primitive in vision models, rather than just an augmentation to convolutional models. The authors develop a simple local self-attention layer that can be used for both small and large inputs, and demonstrate its effectiveness by building a fully self-attentional model. This model outperforms convolutional baselines on ImageNet classification and COCO object detection tasks while being more parameter-efficient and requiring fewer floating-point operations. Ablation studies show that self-attention is particularly impactful in later layers of the network. The results establish that stand-alone self-attention is a valuable addition to the toolkit of vision practitioners, suggesting future research directions to explore content-based interactions further.This paper explores the potential of stand-alone self-attention as a primary primitive in vision models, rather than just an augmentation to convolutional models. The authors develop a simple local self-attention layer that can be used for both small and large inputs, and demonstrate its effectiveness by building a fully self-attentional model. This model outperforms convolutional baselines on ImageNet classification and COCO object detection tasks while being more parameter-efficient and requiring fewer floating-point operations. Ablation studies show that self-attention is particularly impactful in later layers of the network. The results establish that stand-alone self-attention is a valuable addition to the toolkit of vision practitioners, suggesting future research directions to explore content-based interactions further.

Stand-Alone Self-Attention in Vision Models

13 Jun 2019 | Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens