30 Nov 2021 | Li Yuan1*, Yunpeng Chen2, Tao Wang1,3*, Weihao Yu1, Yujun Shi1, Zihang Jiang1, Francis E.H. Tay1, Jiashi Feng1, Shuicheng Yan1
This paper introduces T2T-ViT, a new Vision Transformer (ViT) model that improves performance on ImageNet without pretraining on large datasets like JFT-300M. T2T-ViT addresses two main limitations of ViT: 1) the simple tokenization of images fails to capture local structure, leading to low training efficiency; and 2) the attention backbone is inefficient and limits feature richness. To overcome these issues, T2T-ViT incorporates a Tokens-to-Token (T2T) transformation that progressively structures images into tokens by aggregating neighboring tokens, enabling the model to capture local structure and reduce token length. It also uses a deep-narrow backbone inspired by CNNs, which improves feature richness and reduces parameters and MACs.
T2T-ViT achieves significant improvements over ViT, ResNets, and MobileNets when trained from scratch on ImageNet. For example, T2T-ViT with a size comparable to ResNet50 achieves 83.3% top-1 accuracy on 384x384 images. It also outperforms ResNets and matches MobileNets in performance. The T2T module and deep-narrow backbone are key innovations that enhance the model's ability to capture local structure and improve feature richness. The paper also explores various CNN-inspired architectures for ViT, finding that the deep-narrow structure is most effective. T2T-ViT is efficient, lightweight, and performs well on both large and small models. The results show that T2T-ViT can achieve superior performance compared to CNNs, demonstrating the effectiveness of the proposed architecture.This paper introduces T2T-ViT, a new Vision Transformer (ViT) model that improves performance on ImageNet without pretraining on large datasets like JFT-300M. T2T-ViT addresses two main limitations of ViT: 1) the simple tokenization of images fails to capture local structure, leading to low training efficiency; and 2) the attention backbone is inefficient and limits feature richness. To overcome these issues, T2T-ViT incorporates a Tokens-to-Token (T2T) transformation that progressively structures images into tokens by aggregating neighboring tokens, enabling the model to capture local structure and reduce token length. It also uses a deep-narrow backbone inspired by CNNs, which improves feature richness and reduces parameters and MACs.
T2T-ViT achieves significant improvements over ViT, ResNets, and MobileNets when trained from scratch on ImageNet. For example, T2T-ViT with a size comparable to ResNet50 achieves 83.3% top-1 accuracy on 384x384 images. It also outperforms ResNets and matches MobileNets in performance. The T2T module and deep-narrow backbone are key innovations that enhance the model's ability to capture local structure and improve feature richness. The paper also explores various CNN-inspired architectures for ViT, finding that the deep-narrow structure is most effective. T2T-ViT is efficient, lightweight, and performs well on both large and small models. The results show that T2T-ViT can achieve superior performance compared to CNNs, demonstrating the effectiveness of the proposed architecture.