[slides] Tokens-to-Token ViT%3A Training Vision Transformers from Scratch on ImageNet

The paper "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet" introduces a novel architecture called Tokens-to-Token Vision Transformer (T2T-ViT) to address the limitations of traditional Vision Transformers (ViT) in image classification tasks. ViT, while promising, often underperforms compared to Convolutional Neural Networks (CNNs) when trained from scratch on datasets like ImageNet. The authors identify two main issues: the naive tokenization of images fails to capture local structures such as edges and lines, leading to inefficient training, and the attention backbone of ViT is not optimized for vision tasks, resulting in limited feature richness. To overcome these limitations, T2T-ViT incorporates a layer-wise Tokens-to-Token (T2T) transformation, which progressively structures images into tokens by aggregating neighboring tokens into one token, thus modeling local structures and reducing token length. Additionally, the backbone of T2T-ViT is designed with a deep-narrow structure inspired by CNNs, which reduces parameter count and MACs while improving feature richness. The proposed T2T-ViT achieves significant improvements over vanilla ViT, reducing parameter count and MACs by half while achieving over 3.0% better accuracy on ImageNet. It outperforms ResNets and matches the performance of MobileNets with similar model sizes. The authors also conduct ablation studies to validate the effectiveness of the T2T module and the deep-narrow architecture design. Overall, the paper demonstrates that T2T-ViT can achieve superior performance in image classification tasks without pre-training on large datasets, paving the way for further development of transformer-based models in computer vision.The paper "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet" introduces a novel architecture called Tokens-to-Token Vision Transformer (T2T-ViT) to address the limitations of traditional Vision Transformers (ViT) in image classification tasks. ViT, while promising, often underperforms compared to Convolutional Neural Networks (CNNs) when trained from scratch on datasets like ImageNet. The authors identify two main issues: the naive tokenization of images fails to capture local structures such as edges and lines, leading to inefficient training, and the attention backbone of ViT is not optimized for vision tasks, resulting in limited feature richness. To overcome these limitations, T2T-ViT incorporates a layer-wise Tokens-to-Token (T2T) transformation, which progressively structures images into tokens by aggregating neighboring tokens into one token, thus modeling local structures and reducing token length. Additionally, the backbone of T2T-ViT is designed with a deep-narrow structure inspired by CNNs, which reduces parameter count and MACs while improving feature richness. The proposed T2T-ViT achieves significant improvements over vanilla ViT, reducing parameter count and MACs by half while achieving over 3.0% better accuracy on ImageNet. It outperforms ResNets and matches the performance of MobileNets with similar model sizes. The authors also conduct ablation studies to validate the effectiveness of the T2T module and the deep-narrow architecture design. Overall, the paper demonstrates that T2T-ViT can achieve superior performance in image classification tasks without pre-training on large datasets, paving the way for further development of transformer-based models in computer vision.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

30 Nov 2021 | Li Yuan1*, Yunpeng Chen2, Tao Wang1,3*, Weihao Yu1, Yujun Shi1, Zihang Jiang1, Francis E.H. Tay1, Jiashi Feng1, Shuicheng Yan1

30 Nov 2021 | Li Yuan1, Yunpeng Chen2, Tao Wang1,3, Weihao Yu1, Yujun Shi1, Zihang Jiang1, Francis E.H. Tay1, Jiashi Feng1, Shuicheng Yan1