Do ImageNet Classifiers Generalize to ImageNet?

Do ImageNet Classifiers Generalize to ImageNet?

12 Jun 2019 | Benjamin Recht*, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar
Do ImageNet Classifiers Generalize to ImageNet? Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar Abstract: We create new test sets for CIFAR-10 and ImageNet. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively reused test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3%–15% on CIFAR-10 and 11%–14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets. Introduction: The goal of machine learning is to produce models that generalize. We usually measure generalization by performance on a held-out test set. This paper replicates the dataset creation process for two benchmarks, CIFAR-10 and ImageNet. We find that a wide range of classification models fail to reach their original accuracy scores. The accuracy drops range from 3% to 15% on CIFAR-10 and 11% to 14% on ImageNet. On ImageNet, the accuracy loss amounts to approximately five years of progress in a highly active period of machine learning research. Conventional wisdom suggests that such drops arise because the models have been adapted to the specific images in the original test sets. However, our experiments show that the relative order of models is almost exactly preserved on our new test sets. Moreover, there are no diminishing returns in accuracy. In fact, every percentage point of accuracy improvement on the original test set translates to a larger improvement on our new test sets. These results provide evidence that exhaustive test set evaluations are an effective way to improve image classification models. Adaptivity is therefore an unlikely explanation for the accuracy drops. Instead, we propose an alternative explanation based on the relative difficulty of the original and new test sets. We demonstrate that it is possible to recover the original ImageNet accuracies almost exactly if we only include the easiest images from our candidate pool. This suggests that the accuracy scores of even the best image classifiers are still highly sensitive to minutiae of the data cleaning process. This brittleness puts claims about human-level performance into context. It also shows that current classifiers still do not generalize reliably even in the benign environment of a carefully controlled reproducibility experiment. Figure 1 shows the main result of our experiment. Before we describe our methodology in Section 3, the next section provides relevant background. To enable future research, we release both our new test sets and the corresponding code. Potential Causes of Accuracy Drops: We adopt the standard classification setup and positDo ImageNet Classifiers Generalize to ImageNet? Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar Abstract: We create new test sets for CIFAR-10 and ImageNet. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively reused test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3%–15% on CIFAR-10 and 11%–14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets. Introduction: The goal of machine learning is to produce models that generalize. We usually measure generalization by performance on a held-out test set. This paper replicates the dataset creation process for two benchmarks, CIFAR-10 and ImageNet. We find that a wide range of classification models fail to reach their original accuracy scores. The accuracy drops range from 3% to 15% on CIFAR-10 and 11% to 14% on ImageNet. On ImageNet, the accuracy loss amounts to approximately five years of progress in a highly active period of machine learning research. Conventional wisdom suggests that such drops arise because the models have been adapted to the specific images in the original test sets. However, our experiments show that the relative order of models is almost exactly preserved on our new test sets. Moreover, there are no diminishing returns in accuracy. In fact, every percentage point of accuracy improvement on the original test set translates to a larger improvement on our new test sets. These results provide evidence that exhaustive test set evaluations are an effective way to improve image classification models. Adaptivity is therefore an unlikely explanation for the accuracy drops. Instead, we propose an alternative explanation based on the relative difficulty of the original and new test sets. We demonstrate that it is possible to recover the original ImageNet accuracies almost exactly if we only include the easiest images from our candidate pool. This suggests that the accuracy scores of even the best image classifiers are still highly sensitive to minutiae of the data cleaning process. This brittleness puts claims about human-level performance into context. It also shows that current classifiers still do not generalize reliably even in the benign environment of a carefully controlled reproducibility experiment. Figure 1 shows the main result of our experiment. Before we describe our methodology in Section 3, the next section provides relevant background. To enable future research, we release both our new test sets and the corresponding code. Potential Causes of Accuracy Drops: We adopt the standard classification setup and posit
Reach us at info@study.space