7 Apr 2021 | Hugo Touvron*,† Matthieu Cord† Alexandre Sablayrolles* Gabriel Synnaeve* Hervé Jégou*
This paper explores the optimization and architectural design of deep transformer networks for image classification, aiming to improve their performance and efficiency. The authors investigate two main contributions: LayerScale and CaiT (Class-Attention in Image Transformers).
1. **LayerScale**: This method introduces a learnable diagonal matrix to each residual block, initialized with small values, to improve the training dynamics and convergence of deeper transformer models. It allows for more diverse optimization and helps train deeper high-capacity image transformers without early saturation.
2. **CaiT Architecture**: This architecture separates the transformer layers into two stages: self-attention and class-attention. The self-attention stage processes the input patches, while the class-attention stage extracts the class embedding from the processed patches. This separation avoids the contradictory objectives of guiding attention and class embedding, leading to better performance.
The authors experimentally demonstrate the effectiveness of these methods:
- **LayerScale** significantly improves the convergence and accuracy of deeper image transformers, achieving 86.5% top-1 accuracy on ImageNet with no external data.
- **CaiT** models establish new state-of-the-art results on ImageNet-Real and ImageNet V2 matched frequency, with fewer FLOPs and parameters compared to competing models.
The paper also discusses related work, visualizations of attention mechanisms, and provides ablation studies to support the proposed methods. Overall, the work shows that transformer models can offer competitive alternatives to convolutional neural networks in terms of accuracy and complexity.This paper explores the optimization and architectural design of deep transformer networks for image classification, aiming to improve their performance and efficiency. The authors investigate two main contributions: LayerScale and CaiT (Class-Attention in Image Transformers).
1. **LayerScale**: This method introduces a learnable diagonal matrix to each residual block, initialized with small values, to improve the training dynamics and convergence of deeper transformer models. It allows for more diverse optimization and helps train deeper high-capacity image transformers without early saturation.
2. **CaiT Architecture**: This architecture separates the transformer layers into two stages: self-attention and class-attention. The self-attention stage processes the input patches, while the class-attention stage extracts the class embedding from the processed patches. This separation avoids the contradictory objectives of guiding attention and class embedding, leading to better performance.
The authors experimentally demonstrate the effectiveness of these methods:
- **LayerScale** significantly improves the convergence and accuracy of deeper image transformers, achieving 86.5% top-1 accuracy on ImageNet with no external data.
- **CaiT** models establish new state-of-the-art results on ImageNet-Real and ImageNet V2 matched frequency, with fewer FLOPs and parameters compared to competing models.
The paper also discusses related work, visualizations of attention mechanisms, and provides ablation studies to support the proposed methods. Overall, the work shows that transformer models can offer competitive alternatives to convolutional neural networks in terms of accuracy and complexity.