Do Better ImageNet Models Transfer Better?

Do Better ImageNet Models Transfer Better?

17 Jun 2019 | Simon Kornblith*, Jonathon Shlens, and Quoc V. Le
Do Better ImageNet Models Transfer Better? Simon Kornblith, Jonathon Shlens, and Quoc V. Le Google Brain Abstract: Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from ImageNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested. Introduction: The last decade of computer vision research has pursued academic benchmarks as a measure of progress. No benchmark has been as hotly pursued as ImageNet. Network architectures measured against this dataset have fueled much progress in computer vision research across a broad array of problems, including transferring to new datasets, object detection, image segmentation, and perceptual metrics of images. An implicit assumption behind this progress is that network architectures that perform better on ImageNet necessarily perform better on other vision tasks. Another assumption is that better ImageNet models provide better transfer performance. Results: We examined 16 modern networks ranging in ImageNet top-1 accuracy from 71.6% to 80.8%. These networks encompassed widely used Inception architectures, ResNets, DenseNets, MobileNets, and NASNets. For fair comparison, we retrained all models with scale parameters for batch normalization layers and without label smoothing, dropout, or auxiliary heads. We evaluated models on 12 image classification datasets ranging in training set size from 2,040 to 75,750 images. These datasets covered a wide range of image classification tasks, including superordinate-level object classification, fine-grained object classification, texture classification, and scene classification. We found that ImageNet accuracy is a strong predictor of transfer accuracy for logistic regression on penultimate layer features and fine-tuning. However, regularizers that improve ImageNet performance are highly detrimental to the performance of transfer learning based on penultimate layer features. Architectures transfer well across tasks even when weights do not. On two smallDo Better ImageNet Models Transfer Better? Simon Kornblith, Jonathon Shlens, and Quoc V. Le Google Brain Abstract: Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from ImageNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested. Introduction: The last decade of computer vision research has pursued academic benchmarks as a measure of progress. No benchmark has been as hotly pursued as ImageNet. Network architectures measured against this dataset have fueled much progress in computer vision research across a broad array of problems, including transferring to new datasets, object detection, image segmentation, and perceptual metrics of images. An implicit assumption behind this progress is that network architectures that perform better on ImageNet necessarily perform better on other vision tasks. Another assumption is that better ImageNet models provide better transfer performance. Results: We examined 16 modern networks ranging in ImageNet top-1 accuracy from 71.6% to 80.8%. These networks encompassed widely used Inception architectures, ResNets, DenseNets, MobileNets, and NASNets. For fair comparison, we retrained all models with scale parameters for batch normalization layers and without label smoothing, dropout, or auxiliary heads. We evaluated models on 12 image classification datasets ranging in training set size from 2,040 to 75,750 images. These datasets covered a wide range of image classification tasks, including superordinate-level object classification, fine-grained object classification, texture classification, and scene classification. We found that ImageNet accuracy is a strong predictor of transfer accuracy for logistic regression on penultimate layer features and fine-tuning. However, regularizers that improve ImageNet performance are highly detrimental to the performance of transfer learning based on penultimate layer features. Architectures transfer well across tasks even when weights do not. On two small
Reach us at info@study.space