15 Jan 2021 | Hugo Touvron*,† Matthieu Cord† Matthijs Douze* Francisco Massa* Alexandre Sablayrolles* Hervé Jégou*
This paper introduces DeiT (Data-Efficient Image Transformers), a novel approach to training image transformers without the need for large datasets. The authors demonstrate that their method can achieve competitive performance on ImageNet with only a single 8-GPU node and a training time of two to three days. They also introduce a teacher-student strategy specific to transformers, using a distillation token to ensure the student model learns from the teacher through attention. This strategy outperforms traditional distillation methods, especially when using a convolutional neural network (CNN) as the teacher. The results show that image transformers can learn more effectively from a CNN teacher compared to other transformers. Additionally, the pre-trained models on ImageNet perform well on various downstream tasks, such as fine-grained classification. The paper provides an open-source implementation and discusses the key hyperparameters and training techniques used.This paper introduces DeiT (Data-Efficient Image Transformers), a novel approach to training image transformers without the need for large datasets. The authors demonstrate that their method can achieve competitive performance on ImageNet with only a single 8-GPU node and a training time of two to three days. They also introduce a teacher-student strategy specific to transformers, using a distillation token to ensure the student model learns from the teacher through attention. This strategy outperforms traditional distillation methods, especially when using a convolutional neural network (CNN) as the teacher. The results show that image transformers can learn more effectively from a CNN teacher compared to other transformers. Additionally, the pre-trained models on ImageNet perform well on various downstream tasks, such as fine-grained classification. The paper provides an open-source implementation and discusses the key hyperparameters and training techniques used.