26 Oct 2021 | Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, Yunhe Wang
The paper introduces a novel architecture called Transformer-iN-Transformer (TNT) for visual recognition tasks. Traditional visual transformers, such as Vision Transformers (ViT), divide images into patches to capture global and local information. However, this approach may overlook fine-grained details within patches. To address this, TNT divides each patch into smaller sub-patches (visual words) and introduces an inner transformer block to model the relationships between these sub-patches. This enhances the representation ability by capturing more detailed features. The outer transformer block then processes the sentence-level representations, which are formed by aggregating the features of visual words. Experiments on ImageNet and downstream tasks demonstrate that TNT achieves higher accuracy with a similar computational cost compared to state-of-the-art visual transformers. The paper also includes complexity analysis, ablation studies, and visualization of feature and attention maps to support the effectiveness of the proposed architecture.The paper introduces a novel architecture called Transformer-iN-Transformer (TNT) for visual recognition tasks. Traditional visual transformers, such as Vision Transformers (ViT), divide images into patches to capture global and local information. However, this approach may overlook fine-grained details within patches. To address this, TNT divides each patch into smaller sub-patches (visual words) and introduces an inner transformer block to model the relationships between these sub-patches. This enhances the representation ability by capturing more detailed features. The outer transformer block then processes the sentence-level representations, which are formed by aggregating the features of visual words. Experiments on ImageNet and downstream tasks demonstrate that TNT achieves higher accuracy with a similar computational cost compared to state-of-the-art visual transformers. The paper also includes complexity analysis, ablation studies, and visualization of feature and attention maps to support the effectiveness of the proposed architecture.