16 Jul 2021 | Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang
The paper introduces WILDS, a curated benchmark of 10 datasets designed to reflect real-world distribution shifts that can significantly degrade the accuracy of machine learning (ML) systems. These shifts include domain generalization, where the training and test distributions are from different but related domains, and subpopulation shift, where the test distribution is a subset of the training distribution. The datasets cover a wide range of applications, such as animal species classification, tumor identification, genetic perturbation classification, molecular property prediction, wheat head detection, text toxicity classification, land use classification, poverty mapping, sentiment analysis, and code completion. Each dataset is designed to have realistic distribution shifts and performance drops, making them suitable for evaluating models' robustness to real-world challenges. The paper also provides an open-source Python package that automates data loading, evaluation, and includes default models and hyperparameters. The goal is to facilitate the development of ML methods that can handle distribution shifts and be deployed reliably in various real-world settings.The paper introduces WILDS, a curated benchmark of 10 datasets designed to reflect real-world distribution shifts that can significantly degrade the accuracy of machine learning (ML) systems. These shifts include domain generalization, where the training and test distributions are from different but related domains, and subpopulation shift, where the test distribution is a subset of the training distribution. The datasets cover a wide range of applications, such as animal species classification, tumor identification, genetic perturbation classification, molecular property prediction, wheat head detection, text toxicity classification, land use classification, poverty mapping, sentiment analysis, and code completion. Each dataset is designed to have realistic distribution shifts and performance drops, making them suitable for evaluating models' robustness to real-world challenges. The paper also provides an open-source Python package that automates data loading, evaluation, and includes default models and hyperparameters. The goal is to facilitate the development of ML methods that can handle distribution shifts and be deployed reliably in various real-world settings.