16 Jul 2021 | Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang
WILDS is a benchmark of 10 datasets that reflect diverse real-world distribution shifts, such as shifts across hospitals for tumor identification, across camera traps for wildlife monitoring, and across time and location in satellite imaging and poverty mapping. These shifts are under-represented in widely used ML datasets, which are often designed for the standard i.i.d. setting. WILDS provides a curated benchmark with evaluation metrics and train/test splits that capture a broad range of distribution shifts ML models face in the wild. The datasets include tasks like animal species classification, tumor identification, bioassay prediction, genetic perturbation classification, wheat head detection, text toxicity classification, land use classification, poverty mapping, sentiment analysis, and code completion. Each dataset reflects natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases. WILDS includes an open-source Python package that automates data loading and evaluation, along with default models and a public leaderboard for tracking state-of-the-art methods. The benchmark aims to facilitate the development of ML methods and models that are robust to real-world distribution shifts, enabling reliable deployment in the wild. WILDS complements existing benchmarks by focusing on datasets with realistic shifts across diverse data modalities and applications, and it includes guidelines for method developers to ensure fair and effective evaluation. The datasets are designed to capture shifts that significantly degrade model performance, with training and test splits reflecting these shifts. The benchmark also includes discussions on other application areas where distribution shifts occur, such as algorithmic fairness, medicine and healthcare, genomics, natural language and speech processing, education, and robotics. WILDS provides a standardized framework for evaluating models across a wide range of real-world distribution shifts, helping to advance research in robust ML methods.WILDS is a benchmark of 10 datasets that reflect diverse real-world distribution shifts, such as shifts across hospitals for tumor identification, across camera traps for wildlife monitoring, and across time and location in satellite imaging and poverty mapping. These shifts are under-represented in widely used ML datasets, which are often designed for the standard i.i.d. setting. WILDS provides a curated benchmark with evaluation metrics and train/test splits that capture a broad range of distribution shifts ML models face in the wild. The datasets include tasks like animal species classification, tumor identification, bioassay prediction, genetic perturbation classification, wheat head detection, text toxicity classification, land use classification, poverty mapping, sentiment analysis, and code completion. Each dataset reflects natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases. WILDS includes an open-source Python package that automates data loading and evaluation, along with default models and a public leaderboard for tracking state-of-the-art methods. The benchmark aims to facilitate the development of ML methods and models that are robust to real-world distribution shifts, enabling reliable deployment in the wild. WILDS complements existing benchmarks by focusing on datasets with realistic shifts across diverse data modalities and applications, and it includes guidelines for method developers to ensure fair and effective evaluation. The datasets are designed to capture shifts that significantly degrade model performance, with training and test splits reflecting these shifts. The benchmark also includes discussions on other application areas where distribution shifts occur, such as algorithmic fairness, medicine and healthcare, genomics, natural language and speech processing, education, and robotics. WILDS provides a standardized framework for evaluating models across a wide range of real-world distribution shifts, helping to advance research in robust ML methods.