11 Jun 2021 | Ilya Tolstikhin*, Neil Houlsby*, Alexander Kolesnikov*, Lucas Beyer*, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
The paper introduces MLP-Mixer, a novel architecture for computer vision that relies solely on multi-layer perceptrons (MLPs) without using convolutions or self-attention. MLP-Mixer consists of two types of layers: channel-mixing MLPs and token-mixing MLPs. Channel-mixing MLPs operate on individual rows of the input table, allowing communication between different channels, while token-mixing MLPs operate on columns, enabling communication between different spatial locations. This architecture is designed to be simple and efficient, using only basic matrix multiplication, reshaping, and scalar nonlinearities.
The authors demonstrate that MLP-Mixer achieves competitive performance on image classification benchmarks when trained on large datasets or with modern regularization techniques. Specifically, MLP-Mixer reaches 87.94% top-1 accuracy on ImageNet, comparable to state-of-the-art models like Vision Transformers (ViTs) and ResNets. The architecture is also significantly faster than ViTs, running 2.5 times faster than ViT-H/14 and almost twice as fast as BiT.
Experiments show that MLP-Mixer's performance improves with larger pre-training datasets, and it outperforms other models like ViTs and ResNets in terms of accuracy-compute trade-offs. The paper also explores the invariance of MLP-Mixer to input permutations, showing that it is more robust to changes in the order of patches and pixels compared to traditional CNNs.
Overall, the authors aim to spark further research beyond established CNNs and Transformers, exploring the potential of MLP-based architectures in computer vision.The paper introduces MLP-Mixer, a novel architecture for computer vision that relies solely on multi-layer perceptrons (MLPs) without using convolutions or self-attention. MLP-Mixer consists of two types of layers: channel-mixing MLPs and token-mixing MLPs. Channel-mixing MLPs operate on individual rows of the input table, allowing communication between different channels, while token-mixing MLPs operate on columns, enabling communication between different spatial locations. This architecture is designed to be simple and efficient, using only basic matrix multiplication, reshaping, and scalar nonlinearities.
The authors demonstrate that MLP-Mixer achieves competitive performance on image classification benchmarks when trained on large datasets or with modern regularization techniques. Specifically, MLP-Mixer reaches 87.94% top-1 accuracy on ImageNet, comparable to state-of-the-art models like Vision Transformers (ViTs) and ResNets. The architecture is also significantly faster than ViTs, running 2.5 times faster than ViT-H/14 and almost twice as fast as BiT.
Experiments show that MLP-Mixer's performance improves with larger pre-training datasets, and it outperforms other models like ViTs and ResNets in terms of accuracy-compute trade-offs. The paper also explores the invariance of MLP-Mixer to input permutations, showing that it is more robust to changes in the order of patches and pixels compared to traditional CNNs.
Overall, the authors aim to spark further research beyond established CNNs and Transformers, exploring the potential of MLP-based architectures in computer vision.