Going deeper with Image Transformers

Going deeper with Image Transformers

7 Apr 2021 | Hugo Touvron*,† Matthieu Cord† Alexandre Sablayrolles* Gabriel Synnaeve* Hervé Jégou*
This paper presents a study on deeper image transformers, focusing on improving their training and performance. The authors propose two key contributions: LayerScale and class-attention layers (CaiT). LayerScale is a method that adds a learnable diagonal matrix to the output of each residual block, which helps stabilize training and allows for deeper models with better performance. CaiT introduces a specialized class-attention layer that separates the processing of patches from the class embedding, enabling more effective learning of the class representation. The authors evaluate their models on ImageNet and other datasets, achieving state-of-the-art results with fewer parameters and FLOPs. Their best model achieves 86.5% top-1 accuracy on ImageNet without additional training data, outperforming previous methods. They also show that their models perform well in transfer learning tasks. The paper discusses the challenges of training deeper transformers, including instability and the need for careful initialization and optimization. The authors propose LayerScale as a solution that improves training dynamics and allows for deeper models. They also introduce CaiT, which improves the processing of class embeddings and leads to better performance. The experiments show that LayerScale significantly improves convergence and accuracy for deeper models, while CaiT enhances the processing of class embeddings. The models are evaluated on various datasets, including ImageNet, ImageNet-Real, and ImageNet-V2, demonstrating their effectiveness and efficiency. The results show that the proposed methods outperform existing approaches in terms of accuracy and efficiency.This paper presents a study on deeper image transformers, focusing on improving their training and performance. The authors propose two key contributions: LayerScale and class-attention layers (CaiT). LayerScale is a method that adds a learnable diagonal matrix to the output of each residual block, which helps stabilize training and allows for deeper models with better performance. CaiT introduces a specialized class-attention layer that separates the processing of patches from the class embedding, enabling more effective learning of the class representation. The authors evaluate their models on ImageNet and other datasets, achieving state-of-the-art results with fewer parameters and FLOPs. Their best model achieves 86.5% top-1 accuracy on ImageNet without additional training data, outperforming previous methods. They also show that their models perform well in transfer learning tasks. The paper discusses the challenges of training deeper transformers, including instability and the need for careful initialization and optimization. The authors propose LayerScale as a solution that improves training dynamics and allows for deeper models. They also introduce CaiT, which improves the processing of class embeddings and leads to better performance. The experiments show that LayerScale significantly improves convergence and accuracy for deeper models, while CaiT enhances the processing of class embeddings. The models are evaluated on various datasets, including ImageNet, ImageNet-Real, and ImageNet-V2, demonstrating their effectiveness and efficiency. The results show that the proposed methods outperform existing approaches in terms of accuracy and efficiency.
Reach us at info@study.space