BoTNet is a backbone architecture that integrates self-attention mechanisms into ResNet bottleneck blocks, enhancing performance in image classification, object detection, and instance segmentation. By replacing spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet, BoTNet achieves significant improvements in instance segmentation and object detection while reducing parameters and maintaining low latency. The design of BoTNet allows ResNet bottleneck blocks with self-attention to be viewed as Transformer blocks, demonstrating the effectiveness of self-attention in vision tasks. BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework, surpassing previous results. Additionally, a BoTNet-based model achieves 84.7% top-1 accuracy on the ImageNet benchmark, being up to 1.64x faster than EfficientNet on TPU-v3 hardware. BoTNet is a simple and effective approach that serves as a strong baseline for future research in self-attention models for vision. The paper also explores the performance of BoTNet in various settings, including different training schedules, image resolutions, and comparisons with other architectures like Non-Local Neural Networks. BoTNet shows significant improvements in performance, especially with larger images and increased training. The results demonstrate that self-attention can effectively capture long-range dependencies, leading to better performance in vision tasks. The paper concludes that self-attention is a promising approach for vision tasks and highlights the potential of hybrid models combining convolutions and self-attention.BoTNet is a backbone architecture that integrates self-attention mechanisms into ResNet bottleneck blocks, enhancing performance in image classification, object detection, and instance segmentation. By replacing spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet, BoTNet achieves significant improvements in instance segmentation and object detection while reducing parameters and maintaining low latency. The design of BoTNet allows ResNet bottleneck blocks with self-attention to be viewed as Transformer blocks, demonstrating the effectiveness of self-attention in vision tasks. BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework, surpassing previous results. Additionally, a BoTNet-based model achieves 84.7% top-1 accuracy on the ImageNet benchmark, being up to 1.64x faster than EfficientNet on TPU-v3 hardware. BoTNet is a simple and effective approach that serves as a strong baseline for future research in self-attention models for vision. The paper also explores the performance of BoTNet in various settings, including different training schedules, image resolutions, and comparisons with other architectures like Non-Local Neural Networks. BoTNet shows significant improvements in performance, especially with larger images and increased training. The results demonstrate that self-attention can effectively capture long-range dependencies, leading to better performance in vision tasks. The paper concludes that self-attention is a promising approach for vision tasks and highlights the potential of hybrid models combining convolutions and self-attention.