15 Sep 2021 | Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan
CoAtNet: Marrying Convolution and Attention for All Data Sizes
**Abstract:**
Transformers have gained significant attention in computer vision, but they still lag behind state-of-the-art convolutional networks (ConvNets) in terms of generalization. This work introduces CoAtNets, a family of hybrid models that combine the strengths of both ConvNets and Transformers. CoAtNets achieve this by unifying depthwise convolution and self-attention through relative attention and vertically stacking convolution and attention layers. Experiments show that CoAtNets outperform existing models under various resource constraints, achieving 86.0% top-1 accuracy on ImageNet with no extra data, 88.56% top-1 accuracy with 13M ImageNet-21K pre-training, and 90.88% top-1 accuracy with 3B JFT-3B pre-training, setting new state-of-the-art results.
**Introduction:**
ConvNets have dominated computer vision tasks since AlexNet, while Transformers have shown impressive performance in natural language processing. Vision Transformers (ViT) have achieved comparable results to ConvNets on ImageNet with large datasets, but still fall behind in low-data regimes. This work systematically studies the combination of convolution and attention, focusing on generalization and model capacity. CoAtNets leverage the inductive biases of ConvNets for better generalization and the scalability of Transformers for improved efficiency.
**Model:**
CoAtNets combine depthwise convolution and self-attention using relative attention, which is a natural mixture of both. The model is designed with a vertical layout, stacking convolution and attention layers to balance generalization and capacity. Experiments show that CoAtNets outperform ViT variants and match or surpass state-of-the-art ConvNets under different data sizes and resource constraints.
**Related Work:**
The paper reviews existing work on combining convolution and attention, including relative attention and Transformer backbones. CoAtNets differ by systematically integrating these elements to achieve better performance.
**Experiments:**
CoAtNets are evaluated on ImageNet-1K, ImageNet-21K, and JFT datasets. Results show that CoAtNets achieve state-of-the-art performance with minimal additional computation and parameters, demonstrating their efficiency and effectiveness.
**Conclusion:**
CoAtNets effectively combine the strengths of ConvNets and Transformers, achieving superior performance in various data sizes and resource constraints. The approach is applicable to broader tasks like object detection and semantic segmentation, with potential for future work.CoAtNet: Marrying Convolution and Attention for All Data Sizes
**Abstract:**
Transformers have gained significant attention in computer vision, but they still lag behind state-of-the-art convolutional networks (ConvNets) in terms of generalization. This work introduces CoAtNets, a family of hybrid models that combine the strengths of both ConvNets and Transformers. CoAtNets achieve this by unifying depthwise convolution and self-attention through relative attention and vertically stacking convolution and attention layers. Experiments show that CoAtNets outperform existing models under various resource constraints, achieving 86.0% top-1 accuracy on ImageNet with no extra data, 88.56% top-1 accuracy with 13M ImageNet-21K pre-training, and 90.88% top-1 accuracy with 3B JFT-3B pre-training, setting new state-of-the-art results.
**Introduction:**
ConvNets have dominated computer vision tasks since AlexNet, while Transformers have shown impressive performance in natural language processing. Vision Transformers (ViT) have achieved comparable results to ConvNets on ImageNet with large datasets, but still fall behind in low-data regimes. This work systematically studies the combination of convolution and attention, focusing on generalization and model capacity. CoAtNets leverage the inductive biases of ConvNets for better generalization and the scalability of Transformers for improved efficiency.
**Model:**
CoAtNets combine depthwise convolution and self-attention using relative attention, which is a natural mixture of both. The model is designed with a vertical layout, stacking convolution and attention layers to balance generalization and capacity. Experiments show that CoAtNets outperform ViT variants and match or surpass state-of-the-art ConvNets under different data sizes and resource constraints.
**Related Work:**
The paper reviews existing work on combining convolution and attention, including relative attention and Transformer backbones. CoAtNets differ by systematically integrating these elements to achieve better performance.
**Experiments:**
CoAtNets are evaluated on ImageNet-1K, ImageNet-21K, and JFT datasets. Results show that CoAtNets achieve state-of-the-art performance with minimal additional computation and parameters, demonstrating their efficiency and effectiveness.
**Conclusion:**
CoAtNets effectively combine the strengths of ConvNets and Transformers, achieving superior performance in various data sizes and resource constraints. The approach is applicable to broader tasks like object detection and semantic segmentation, with potential for future work.