8 Jan 2021 | Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin
This paper introduces SwAV, an online algorithm for unsupervised learning of visual features by contrasting cluster assignments. SwAV improves upon contrastive learning methods by avoiding explicit pairwise feature comparisons and instead enforces consistency between cluster assignments of different image views. The method uses a "swapped" prediction mechanism where the code of one view is predicted from the representation of another. SwAV is memory efficient, does not require a large memory bank or momentum network, and can scale to large datasets. It also introduces a new data augmentation strategy, multi-crop, which uses a mix of views with different resolutions without increasing memory or computational requirements. SwAV achieves 75.3% top-1 accuracy on ImageNet with ResNet-50 and surpasses supervised pretraining on multiple transfer tasks. The method is effective with both small and large batch sizes and can be applied to various self-supervised learning methods, improving performance by 2-4% on ImageNet. The paper also evaluates SwAV on several benchmarks and shows that it outperforms other self-supervised methods in terms of performance and efficiency. The method is particularly effective in online learning settings and can be used for unsupervised pretraining on large, uncurated datasets.This paper introduces SwAV, an online algorithm for unsupervised learning of visual features by contrasting cluster assignments. SwAV improves upon contrastive learning methods by avoiding explicit pairwise feature comparisons and instead enforces consistency between cluster assignments of different image views. The method uses a "swapped" prediction mechanism where the code of one view is predicted from the representation of another. SwAV is memory efficient, does not require a large memory bank or momentum network, and can scale to large datasets. It also introduces a new data augmentation strategy, multi-crop, which uses a mix of views with different resolutions without increasing memory or computational requirements. SwAV achieves 75.3% top-1 accuracy on ImageNet with ResNet-50 and surpasses supervised pretraining on multiple transfer tasks. The method is effective with both small and large batch sizes and can be applied to various self-supervised learning methods, improving performance by 2-4% on ImageNet. The paper also evaluates SwAV on several benchmarks and shows that it outperforms other self-supervised methods in terms of performance and efficiency. The method is particularly effective in online learning settings and can be used for unsupervised pretraining on large, uncurated datasets.