Understanding Transformer in Transformer

The paper introduces a novel architecture called Transformer-iN-Transformer (TNT) for visual recognition tasks. Traditional visual transformers, such as Vision Transformers (ViT), divide images into patches to capture global and local information. However, this approach may overlook fine-grained details within patches. To address this, TNT divides each patch into smaller sub-patches (visual words) and introduces an inner transformer block to model the relationships between these sub-patches. This enhances the representation ability by capturing more detailed features. The outer transformer block then processes the sentence-level representations, which are formed by aggregating the features of visual words. Experiments on ImageNet and downstream tasks demonstrate that TNT achieves higher accuracy with a similar computational cost compared to state-of-the-art visual transformers. The paper also includes complexity analysis, ablation studies, and visualization of feature and attention maps to support the effectiveness of the proposed architecture.The paper introduces a novel architecture called Transformer-iN-Transformer (TNT) for visual recognition tasks. Traditional visual transformers, such as Vision Transformers (ViT), divide images into patches to capture global and local information. However, this approach may overlook fine-grained details within patches. To address this, TNT divides each patch into smaller sub-patches (visual words) and introduces an inner transformer block to model the relationships between these sub-patches. This enhances the representation ability by capturing more detailed features. The outer transformer block then processes the sentence-level representations, which are formed by aggregating the features of visual words. Experiments on ImageNet and downstream tasks demonstrate that TNT achieves higher accuracy with a similar computational cost compared to state-of-the-art visual transformers. The paper also includes complexity analysis, ablation studies, and visualization of feature and attention maps to support the effectiveness of the proposed architecture.

Transformer in Transformer

26 Oct 2021 | Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, Yunhe Wang