[slides] Attention Augmented Convolutional Networks

This paper introduces a novel two-dimensional relative self-attention mechanism to enhance convolutional networks for discriminative visual tasks. The authors argue that while convolutions are effective for many computer vision applications, they have a significant limitation in capturing global information due to their local operation. Self-attention, on the other hand, can capture long-range interactions but has primarily been used in sequence and generative modeling tasks. The proposed mechanism combines convolutional and self-attention features by concatenating them, aiming to leverage the strengths of both. Extensive experiments on ImageNet and COCO datasets show that Attention Augmentation leads to consistent improvements in image classification and object detection across various models and scales, including ResNets and a state-of-the-art mobile constrained network. The method achieves a 1.3% top-1 accuracy improvement on ImageNet and a 1.4 mAP improvement on COCO, outperforming other attention mechanisms such as Squeeze-and-Excitation. The paper also discusses the benefits of fully attentional models and the importance of position encodings in maintaining translation equivariance.This paper introduces a novel two-dimensional relative self-attention mechanism to enhance convolutional networks for discriminative visual tasks. The authors argue that while convolutions are effective for many computer vision applications, they have a significant limitation in capturing global information due to their local operation. Self-attention, on the other hand, can capture long-range interactions but has primarily been used in sequence and generative modeling tasks. The proposed mechanism combines convolutional and self-attention features by concatenating them, aiming to leverage the strengths of both. Extensive experiments on ImageNet and COCO datasets show that Attention Augmentation leads to consistent improvements in image classification and object detection across various models and scales, including ResNets and a state-of-the-art mobile constrained network. The method achieves a 1.3% top-1 accuracy improvement on ImageNet and a 1.4 mAP improvement on COCO, outperforming other attention mechanisms such as Squeeze-and-Excitation. The paper also discusses the benefits of fully attentional models and the importance of position encodings in maintaining translation equivariance.

Attention Augmented Convolutional Networks

9 Sep 2020 | Irwan Bello, Barret Zoph, Ashish Vaswani, Quoc V. Le, Jonathon Shlens