Understanding Bottleneck Transformers for Visual Recognition

BoTNet is a novel backbone architecture that incorporates self-attention for multiple computer vision tasks, including image classification, object detection, and instance segmentation. By replacing the spatial convolutions in the final three bottleneck blocks of a ResNet with global self-attention, BoTNet significantly improves performance on instance segmentation and object detection while reducing parameters and minimizing latency overhead. The authors demonstrate that ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks, highlighting the equivalence between the two structures. BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark, surpassing previous state-of-the-art results. Additionally, a simple adaptation of BoTNet for image classification achieves 84.7% top-1 accuracy on the ImageNet benchmark, outperforming EfficientNet models by up to 1.64x in compute time. The paper also discusses the challenges and design choices of using self-attention in vision tasks, and provides a comprehensive evaluation of BoTNet's performance across various experiments.BoTNet is a novel backbone architecture that incorporates self-attention for multiple computer vision tasks, including image classification, object detection, and instance segmentation. By replacing the spatial convolutions in the final three bottleneck blocks of a ResNet with global self-attention, BoTNet significantly improves performance on instance segmentation and object detection while reducing parameters and minimizing latency overhead. The authors demonstrate that ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks, highlighting the equivalence between the two structures. BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark, surpassing previous state-of-the-art results. Additionally, a simple adaptation of BoTNet for image classification achieves 84.7% top-1 accuracy on the ImageNet benchmark, outperforming EfficientNet models by up to 1.64x in compute time. The paper also discusses the challenges and design choices of using self-attention in vision tasks, and provides a comprehensive evaluation of BoTNet's performance across various experiments.

Bottleneck Transformers for Visual Recognition

2 Aug 2021 | Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani