5 May 2020 | Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby
The paper introduces Big Transfer (BiT), a method for pre-training visual representations on large datasets and fine-tuning them on various downstream tasks. BiT scales up the pre-training process by using larger datasets and architectures, achieving strong performance across a wide range of data regimes, from 1 example per class to 1 million total examples. BiT outperforms state-of-the-art methods on multiple benchmarks, including ILSVRC-2012, CIFAR-10/100, Oxford-IIIT Pet, Oxford Flowers-102, and the Visual Task Adaptation Benchmark (VTAB). The authors highlight the importance of model capacity, dataset size, and hyperparameter tuning, and provide a detailed analysis of the components that contribute to BiT's effectiveness. BiT demonstrates superior performance even with very few labeled samples per class and shows promise in challenging tasks such as few-shot learning and object detection. The paper also discusses the impact of normalization layers and large batch training, concluding that Group Normalization (GN) and Weight Standardization (WS) are effective alternatives to Batch Normalization (BN) for large batch sizes.The paper introduces Big Transfer (BiT), a method for pre-training visual representations on large datasets and fine-tuning them on various downstream tasks. BiT scales up the pre-training process by using larger datasets and architectures, achieving strong performance across a wide range of data regimes, from 1 example per class to 1 million total examples. BiT outperforms state-of-the-art methods on multiple benchmarks, including ILSVRC-2012, CIFAR-10/100, Oxford-IIIT Pet, Oxford Flowers-102, and the Visual Task Adaptation Benchmark (VTAB). The authors highlight the importance of model capacity, dataset size, and hyperparameter tuning, and provide a detailed analysis of the components that contribute to BiT's effectiveness. BiT demonstrates superior performance even with very few labeled samples per class and shows promise in challenging tasks such as few-shot learning and object detection. The paper also discusses the impact of normalization layers and large batch training, concluding that Group Normalization (GN) and Weight Standardization (WS) are effective alternatives to Batch Normalization (BN) for large batch sizes.