9 Sep 2020 | Irwan Bello, Barret Zoph, Ashish Vaswani, Quoc V. Le, Jonathon Shlens
Attention Augmented Convolutional Networks (AA-Conv) introduce a novel two-dimensional relative self-attention mechanism to enhance image classification and object detection tasks. The method combines convolutional and self-attention operations, where self-attention is used to capture long-range dependencies while maintaining translation equivariance. The self-attention mechanism is integrated into convolutional networks by concatenating convolutional feature maps with self-attentional feature maps, enabling the model to capture both local and global information. This approach leads to consistent improvements in image classification on ImageNet and object detection on COCO across various models and scales, with similar parameter counts to traditional convolutional networks. The method outperforms existing attention mechanisms such as Squeeze-and-Excitation, achieving a 1.3% top-1 accuracy improvement on ImageNet and a 1.4 mAP improvement on COCO. The self-attention mechanism is also shown to be effective in fully attentional models, which perform comparably to fully convolutional models on ImageNet. The method is implemented with relative position embeddings to maintain translation equivariance and is tested on various architectures including ResNets, MnasNet, and RetinaNet. The results demonstrate that Attention Augmentation provides a competitive alternative to traditional convolutional networks, offering improved performance with minimal additional computational cost.Attention Augmented Convolutional Networks (AA-Conv) introduce a novel two-dimensional relative self-attention mechanism to enhance image classification and object detection tasks. The method combines convolutional and self-attention operations, where self-attention is used to capture long-range dependencies while maintaining translation equivariance. The self-attention mechanism is integrated into convolutional networks by concatenating convolutional feature maps with self-attentional feature maps, enabling the model to capture both local and global information. This approach leads to consistent improvements in image classification on ImageNet and object detection on COCO across various models and scales, with similar parameter counts to traditional convolutional networks. The method outperforms existing attention mechanisms such as Squeeze-and-Excitation, achieving a 1.3% top-1 accuracy improvement on ImageNet and a 1.4 mAP improvement on COCO. The self-attention mechanism is also shown to be effective in fully attentional models, which perform comparably to fully convolutional models on ImageNet. The method is implemented with relative position embeddings to maintain translation equivariance and is tested on various architectures including ResNets, MnasNet, and RetinaNet. The results demonstrate that Attention Augmentation provides a competitive alternative to traditional convolutional networks, offering improved performance with minimal additional computational cost.