Stand-Alone Self-Attention in Vision Models

Stand-Alone Self-Attention in Vision Models

13 Jun 2019 | Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens
This paper presents a study on stand-alone self-attention in vision models. The authors demonstrate that self-attention can serve as a standalone primitive for vision tasks, outperforming convolutional baselines in terms of efficiency and performance. They replace spatial convolutions in a ResNet model with self-attention layers, achieving a 12% reduction in FLOPS and 29% fewer parameters while improving ImageNet classification accuracy. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while using 39% fewer FLOPS and 34% fewer parameters. The study shows that self-attention is especially effective in later layers of the network. The authors also explore the use of relative positional embeddings and find that they significantly improve performance. They conclude that stand-alone self-attention is an important addition to the vision practitioner's toolbox, offering a viable alternative to convolutional models. The results suggest that combining convolution and self-attention in different parts of the network can lead to better performance. The study also highlights the importance of positional information in attention mechanisms and the effectiveness of spatially-aware attention in the stem layer. Overall, the paper demonstrates that self-attention can be a powerful tool for vision tasks, offering both efficiency and performance benefits.This paper presents a study on stand-alone self-attention in vision models. The authors demonstrate that self-attention can serve as a standalone primitive for vision tasks, outperforming convolutional baselines in terms of efficiency and performance. They replace spatial convolutions in a ResNet model with self-attention layers, achieving a 12% reduction in FLOPS and 29% fewer parameters while improving ImageNet classification accuracy. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while using 39% fewer FLOPS and 34% fewer parameters. The study shows that self-attention is especially effective in later layers of the network. The authors also explore the use of relative positional embeddings and find that they significantly improve performance. They conclude that stand-alone self-attention is an important addition to the vision practitioner's toolbox, offering a viable alternative to convolutional models. The results suggest that combining convolution and self-attention in different parts of the network can lead to better performance. The study also highlights the importance of positional information in attention mechanisms and the effectiveness of spatially-aware attention in the stem layer. Overall, the paper demonstrates that self-attention can be a powerful tool for vision tasks, offering both efficiency and performance benefits.
Reach us at info@study.space