26 Oct 2021 | Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, Yunhe Wang
This paper proposes a novel Transformer-in-Transformer (TNT) architecture for visual recognition. The TNT architecture divides input images into local patches (visual sentences) and further splits each patch into smaller sub-patches (visual words). It introduces an inner transformer block to model relationships between visual words and an outer transformer block to process sentence embeddings. The inner transformer block enhances local feature extraction, while the outer transformer block captures intrinsic information from the sequence of sentences. By stacking multiple TNT blocks, the model effectively models both global and local information in images. The TNT architecture achieves higher accuracy and better trade-off between accuracy and complexity compared to state-of-the-art visual transformers. Experiments on ImageNet and downstream tasks demonstrate that TNT outperforms other models, achieving 81.5% top-1 accuracy on ImageNet, which is 1.7% higher than the state-of-the-art visual transformer with similar computational cost. The TNT architecture is also efficient in terms of inference speed and can be applied to various computer vision tasks such as object detection and semantic segmentation. The model is implemented in PyTorch and MindSpore, and the code is available at https://github.com/huawei-noah/CV-Backbones and https://gitee.com/mindspore/models/tree/master/research/cv/TNT.This paper proposes a novel Transformer-in-Transformer (TNT) architecture for visual recognition. The TNT architecture divides input images into local patches (visual sentences) and further splits each patch into smaller sub-patches (visual words). It introduces an inner transformer block to model relationships between visual words and an outer transformer block to process sentence embeddings. The inner transformer block enhances local feature extraction, while the outer transformer block captures intrinsic information from the sequence of sentences. By stacking multiple TNT blocks, the model effectively models both global and local information in images. The TNT architecture achieves higher accuracy and better trade-off between accuracy and complexity compared to state-of-the-art visual transformers. Experiments on ImageNet and downstream tasks demonstrate that TNT outperforms other models, achieving 81.5% top-1 accuracy on ImageNet, which is 1.7% higher than the state-of-the-art visual transformer with similar computational cost. The TNT architecture is also efficient in terms of inference speed and can be applied to various computer vision tasks such as object detection and semantic segmentation. The model is implemented in PyTorch and MindSpore, and the code is available at https://github.com/huawei-noah/CV-Backbones and https://gitee.com/mindspore/models/tree/master/research/cv/TNT.