Training data-efficient image transformers & distillation through attention

Training data-efficient image transformers & distillation through attention

15 Jan 2021 | Hugo Touvron*,† Matthieu Cord† Matthijs Douze* Francisco Massa* Alexandre Sablayrolles* Hervé Jégou*
This paper introduces DeiT, a data-efficient image transformer that achieves competitive performance on ImageNet without requiring large amounts of data. The model is trained on a single computer in less than three days and achieves 83.1% top-1 accuracy on ImageNet with no external data. The key contribution is a novel teacher-student distillation strategy specific to transformers, which uses a distillation token to enable the student to learn from the teacher through attention. This approach outperforms traditional distillation methods and is particularly effective when using a convolutional neural network as a teacher. The model is also shown to perform well on transfer learning tasks and other downstream applications. The paper also discusses the efficiency and accuracy trade-offs between vision transformers and convolutional neural networks, showing that DeiT achieves competitive results with a similar number of parameters and efficiency. The results demonstrate that vision transformers can be trained efficiently with data-efficient methods and that distillation can significantly improve their performance. The paper provides an open-source implementation of the method.This paper introduces DeiT, a data-efficient image transformer that achieves competitive performance on ImageNet without requiring large amounts of data. The model is trained on a single computer in less than three days and achieves 83.1% top-1 accuracy on ImageNet with no external data. The key contribution is a novel teacher-student distillation strategy specific to transformers, which uses a distillation token to enable the student to learn from the teacher through attention. This approach outperforms traditional distillation methods and is particularly effective when using a convolutional neural network as a teacher. The model is also shown to perform well on transfer learning tasks and other downstream applications. The paper also discusses the efficiency and accuracy trade-offs between vision transformers and convolutional neural networks, showing that DeiT achieves competitive results with a similar number of parameters and efficiency. The results demonstrate that vision transformers can be trained efficiently with data-efficient methods and that distillation can significantly improve their performance. The paper provides an open-source implementation of the method.
Reach us at info@study.space
[slides] Training data-efficient image transformers %26 distillation through attention | StudySpace