Do Vision Transformers See Like Convolutional Neural Networks?

Do Vision Transformers See Like Convolutional Neural Networks?

3 Mar 2022 | Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy
The paper "Do Vision Transformers See Like Convolutional Neural Networks?" by Maithra Raghu explores the differences in how Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) process visual data. ViTs, which use self-attention instead of convolution, have shown comparable or superior performance on image classification tasks, raising questions about their underlying mechanisms. The study finds that ViTs have more uniform representations across layers compared to CNNs, with greater similarity between lower and higher layers. This uniformity is attributed to the role of self-attention, which enables early aggregation of global information, and the strong propagation of features through residual connections. The paper also examines the impact of spatial localization, showing that ViTs preserve input spatial information, and investigates the effect of dataset scale on intermediate features and transfer learning. Finally, it discusses the connections to new architectures like the MLP-Mixer, highlighting the fundamental differences and similarities between ViTs and CNNs.The paper "Do Vision Transformers See Like Convolutional Neural Networks?" by Maithra Raghu explores the differences in how Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) process visual data. ViTs, which use self-attention instead of convolution, have shown comparable or superior performance on image classification tasks, raising questions about their underlying mechanisms. The study finds that ViTs have more uniform representations across layers compared to CNNs, with greater similarity between lower and higher layers. This uniformity is attributed to the role of self-attention, which enables early aggregation of global information, and the strong propagation of features through residual connections. The paper also examines the impact of spatial localization, showing that ViTs preserve input spatial information, and investigates the effect of dataset scale on intermediate features and transfer learning. Finally, it discusses the connections to new architectures like the MLP-Mixer, highlighting the fundamental differences and similarities between ViTs and CNNs.
Reach us at info@study.space