Do Vision Transformers See Like Convolutional Neural Networks?

Do Vision Transformers See Like Convolutional Neural Networks?

3 Mar 2022 | Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy
Do Vision Transformers See Like Convolutional Neural Networks? Convolutional neural networks (CNNs) have been the standard for visual data, but recent work shows that Vision Transformers (ViTs) can match or exceed their performance on image classification. This raises the question: do ViTs operate like CNNs, or do they learn different visual representations? By analyzing internal representations of ViTs and CNNs, we find significant differences, such as more uniform representations in ViTs across layers. Self-attention enables global information aggregation, while ViT residual connections propagate features from lower to higher layers. ViTs preserve spatial information, with effects from classification methods. Dataset scale impacts intermediate features and transfer learning, and connections to new architectures like MLP-Mixer are discussed. The paper explores how ViTs and CNNs differ in representation structure, spatial information utilization, and scale effects. ViTs show more uniform layer similarity, with lower and higher layers having greater similarity than in CNNs. ViTs incorporate more global information than CNNs at lower layers, leading to different features. Local information remains important, with large-scale pre-training helping early attention layers learn it. ViT skip connections are more influential than in ResNets, affecting performance and representation similarity. Spatial information is preserved in ViTs, with connections to classification methods. Dataset scale is crucial for transfer learning, with linear probes showing its importance for high-quality intermediate representations. ViTs and CNNs differ in how they process local and global information. ViT self-attention layers aggregate both local and global information, while CNNs have fixed receptive fields. Global information is more accessible in ViTs, leading to different features. ResNet layers show less similarity to ViT layers, with lower layers needing more layers to compute similar representations. Effective receptive fields in ViTs are larger and more global, with strong dependence on central patches. Skip connections in ViTs are crucial for uniform representation structure, with a phase transition from CLS token to spatial token propagation. ViTs maintain spatial location information better than CNNs, with tokens showing strong similarity to their corresponding spatial locations. Spatial localization is important for tasks beyond classification, such as object detection. ViTs trained with global average pooling (GAP) show less localization than those with CLS tokens. Linear probes show that ViT tokens perform well in higher layers, while ResNet tokens are less spatially discriminative. Larger ViT models benefit from larger pretraining datasets, leading to stronger intermediate representations. These findings highlight differences between ViTs and CNNs, with implications for future research and applications.Do Vision Transformers See Like Convolutional Neural Networks? Convolutional neural networks (CNNs) have been the standard for visual data, but recent work shows that Vision Transformers (ViTs) can match or exceed their performance on image classification. This raises the question: do ViTs operate like CNNs, or do they learn different visual representations? By analyzing internal representations of ViTs and CNNs, we find significant differences, such as more uniform representations in ViTs across layers. Self-attention enables global information aggregation, while ViT residual connections propagate features from lower to higher layers. ViTs preserve spatial information, with effects from classification methods. Dataset scale impacts intermediate features and transfer learning, and connections to new architectures like MLP-Mixer are discussed. The paper explores how ViTs and CNNs differ in representation structure, spatial information utilization, and scale effects. ViTs show more uniform layer similarity, with lower and higher layers having greater similarity than in CNNs. ViTs incorporate more global information than CNNs at lower layers, leading to different features. Local information remains important, with large-scale pre-training helping early attention layers learn it. ViT skip connections are more influential than in ResNets, affecting performance and representation similarity. Spatial information is preserved in ViTs, with connections to classification methods. Dataset scale is crucial for transfer learning, with linear probes showing its importance for high-quality intermediate representations. ViTs and CNNs differ in how they process local and global information. ViT self-attention layers aggregate both local and global information, while CNNs have fixed receptive fields. Global information is more accessible in ViTs, leading to different features. ResNet layers show less similarity to ViT layers, with lower layers needing more layers to compute similar representations. Effective receptive fields in ViTs are larger and more global, with strong dependence on central patches. Skip connections in ViTs are crucial for uniform representation structure, with a phase transition from CLS token to spatial token propagation. ViTs maintain spatial location information better than CNNs, with tokens showing strong similarity to their corresponding spatial locations. Spatial localization is important for tasks beyond classification, such as object detection. ViTs trained with global average pooling (GAP) show less localization than those with CLS tokens. Linear probes show that ViT tokens perform well in higher layers, while ResNet tokens are less spatially discriminative. Larger ViT models benefit from larger pretraining datasets, leading to stronger intermediate representations. These findings highlight differences between ViTs and CNNs, with implications for future research and applications.
Reach us at info@study.space