This paper explores the potential of stand-alone self-attention as a primary primitive in vision models, rather than just an augmentation to convolutional models. The authors develop a simple local self-attention layer that can be used for both small and large inputs, and demonstrate its effectiveness by building a fully self-attentional model. This model outperforms convolutional baselines on ImageNet classification and COCO object detection tasks while being more parameter-efficient and requiring fewer floating-point operations. Ablation studies show that self-attention is particularly impactful in later layers of the network. The results establish that stand-alone self-attention is a valuable addition to the toolkit of vision practitioners, suggesting future research directions to explore content-based interactions further.This paper explores the potential of stand-alone self-attention as a primary primitive in vision models, rather than just an augmentation to convolutional models. The authors develop a simple local self-attention layer that can be used for both small and large inputs, and demonstrate its effectiveness by building a fully self-attentional model. This model outperforms convolutional baselines on ImageNet classification and COCO object detection tasks while being more parameter-efficient and requiring fewer floating-point operations. Ablation studies show that self-attention is particularly impactful in later layers of the network. The results establish that stand-alone self-attention is a valuable addition to the toolkit of vision practitioners, suggesting future research directions to explore content-based interactions further.