[slides and audio] A Decade's Battle on Dataset Bias%3A Are We There Yet%3F

The paper revisits the "dataset classification" experiment suggested by Torralba and Efros a decade ago, focusing on modern neural networks and large-scale, diverse datasets. Surprisingly, the authors find that modern neural networks can achieve high accuracy in classifying which dataset an image is from, with a reported accuracy of 84.7% on held-out validation data for a three-way classification problem involving YFCC, CC, and DataComp datasets. The experiments show that these models learn generalizable and transferable semantic features, which cannot be explained by simple memorization. The study also explores the impact of more training data and data augmentation, finding that these factors improve accuracy, similar to semantic classification tasks. Additionally, the authors demonstrate that even self-supervised learning models can capture dataset biases, achieving high accuracy in linear probing tasks. The findings suggest that modern neural networks are capable of discovering hidden biases in datasets, even in large, diverse, and less curated datasets. The paper concludes by calling for further research to address the issue of dataset bias and to improve model capabilities.The paper revisits the "dataset classification" experiment suggested by Torralba and Efros a decade ago, focusing on modern neural networks and large-scale, diverse datasets. Surprisingly, the authors find that modern neural networks can achieve high accuracy in classifying which dataset an image is from, with a reported accuracy of 84.7% on held-out validation data for a three-way classification problem involving YFCC, CC, and DataComp datasets. The experiments show that these models learn generalizable and transferable semantic features, which cannot be explained by simple memorization. The study also explores the impact of more training data and data augmentation, finding that these factors improve accuracy, similar to semantic classification tasks. Additionally, the authors demonstrate that even self-supervised learning models can capture dataset biases, achieving high accuracy in linear probing tasks. The findings suggest that modern neural networks are capable of discovering hidden biases in datasets, even in large, diverse, and less curated datasets. The paper concludes by calling for further research to address the issue of dataset bias and to improve model capabilities.

A Decade’s Battle on Dataset Bias: Are We There Yet?

13 Mar 2024 | Zhuang Liu, Kaiming He