A Decade’s Battle on Dataset Bias: Are We There Yet?
Zhuang Liu and Kaiming He
Abstract: We revisit the "dataset classification" experiment proposed by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from. For example, we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization. We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.
Introduction: In 2011, Torralba and Efros called for a battle against dataset bias in the community, right before the dawn of the deep learning revolution. Over the decade that followed, progress on building diverse, large-scale, comprehensive, and hopefully less biased datasets has been an engine powering the deep learning revolution. In parallel, advances in algorithms, particularly neural network architectures, have achieved unprecedented levels of ability on discovering concepts, abstractions, and patterns—including bias—from data.
In this work, we take a renewed "unbiased look at dataset bias" after the decade-long battle. Our study is driven by the tension between building less biased datasets versus developing more capable models. While efforts to reduce bias in data may lead to progress, the development of advanced models could better exploit dataset bias and thus counteract the promise.
Our study is based on a fabricated task we call dataset classification, which is the "Name That Dataset" experiment designed in [51]. Specifically, we randomly sample a large number of images from each of several datasets, and train a neural network on their union to classify from which dataset an image is taken. The datasets we experiment with are presumably among the most diverse, largest, and uncurated datasets in the wild, collected from the Internet. For example, a typical combination we study, referred to as "YCD", consists of images from YFCC, CC, and DataComp and presents a 3-way dataset classification problem.
We observe that modern neural networks can achieve excellent accuracy on such a dataset classification task. Trained in the aforementioned YCD set that is challenging for human beings, a model can achieve >84% classification accuracy on the held-out validation data, vs. 33.3% of chance-level guess. This observation is highly robust, over a large variety of dataset combinations and across different generations of architectures, with very high accuracy achieved in most cases.
Intriguingly, for such a dataset classification task, we have a series of observations that are analogous to those observed in semantic classification tasksA Decade’s Battle on Dataset Bias: Are We There Yet?
Zhuang Liu and Kaiming He
Abstract: We revisit the "dataset classification" experiment proposed by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from. For example, we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization. We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.
Introduction: In 2011, Torralba and Efros called for a battle against dataset bias in the community, right before the dawn of the deep learning revolution. Over the decade that followed, progress on building diverse, large-scale, comprehensive, and hopefully less biased datasets has been an engine powering the deep learning revolution. In parallel, advances in algorithms, particularly neural network architectures, have achieved unprecedented levels of ability on discovering concepts, abstractions, and patterns—including bias—from data.
In this work, we take a renewed "unbiased look at dataset bias" after the decade-long battle. Our study is driven by the tension between building less biased datasets versus developing more capable models. While efforts to reduce bias in data may lead to progress, the development of advanced models could better exploit dataset bias and thus counteract the promise.
Our study is based on a fabricated task we call dataset classification, which is the "Name That Dataset" experiment designed in [51]. Specifically, we randomly sample a large number of images from each of several datasets, and train a neural network on their union to classify from which dataset an image is taken. The datasets we experiment with are presumably among the most diverse, largest, and uncurated datasets in the wild, collected from the Internet. For example, a typical combination we study, referred to as "YCD", consists of images from YFCC, CC, and DataComp and presents a 3-way dataset classification problem.
We observe that modern neural networks can achieve excellent accuracy on such a dataset classification task. Trained in the aforementioned YCD set that is challenging for human beings, a model can achieve >84% classification accuracy on the held-out validation data, vs. 33.3% of chance-level guess. This observation is highly robust, over a large variety of dataset combinations and across different generations of architectures, with very high accuracy achieved in most cases.
Intriguingly, for such a dataset classification task, we have a series of observations that are analogous to those observed in semantic classification tasks