DECEMBER 2021 | TIMNIT GEBRU, JAMIE MORGENSTERN, BRIANA VECCHIONE, JENNIFER WORTMAN VAUGHAN, HANNA WALLACH, HAL DAUMÉ III, AND KATE CRAWFORD
Datasheets for Datasets aim to address the lack of standardized documentation for machine learning datasets. These documents provide transparency, accountability, and help mitigate biases in machine learning models. They are designed to assist both dataset creators and consumers by documenting the dataset's motivation, composition, collection process, recommended uses, and other relevant information. Datasheets can increase transparency, help avoid unintended biases, and improve the reproducibility of machine learning results.
The authors propose a set of questions and a workflow for dataset creators to use when developing datasheets. These questions are grouped into sections that correspond to key stages of the dataset lifecycle: motivation, composition, collection process, preprocessing/cleaning/labeling, uses, distribution, and maintenance. The questions are not prescriptive but are intended to encourage reflection on the dataset creation process. The authors also note that the process of creating a datasheet is not intended to be automated, as it encourages careful reflection on the dataset's creation, distribution, and maintenance.
Datasheets have been adopted by academic researchers and companies such as Microsoft, Google, and IBM. They have also been used to create documentation for machine learning models and AI services. However, there are challenges in implementing datasheets, including the need for dataset creators to adapt the questions and workflow to their organizational infrastructure and the potential for dynamic datasets to require updated datasheets. Additionally, creating datasheets may impose overhead on dataset creators, but the benefits of increased transparency and accountability in the machine learning community are considered to outweigh the costs.Datasheets for Datasets aim to address the lack of standardized documentation for machine learning datasets. These documents provide transparency, accountability, and help mitigate biases in machine learning models. They are designed to assist both dataset creators and consumers by documenting the dataset's motivation, composition, collection process, recommended uses, and other relevant information. Datasheets can increase transparency, help avoid unintended biases, and improve the reproducibility of machine learning results.
The authors propose a set of questions and a workflow for dataset creators to use when developing datasheets. These questions are grouped into sections that correspond to key stages of the dataset lifecycle: motivation, composition, collection process, preprocessing/cleaning/labeling, uses, distribution, and maintenance. The questions are not prescriptive but are intended to encourage reflection on the dataset creation process. The authors also note that the process of creating a datasheet is not intended to be automated, as it encourages careful reflection on the dataset's creation, distribution, and maintenance.
Datasheets have been adopted by academic researchers and companies such as Microsoft, Google, and IBM. They have also been used to create documentation for machine learning models and AI services. However, there are challenges in implementing datasheets, including the need for dataset creators to adapt the questions and workflow to their organizational infrastructure and the potential for dynamic datasets to require updated datasheets. Additionally, creating datasheets may impose overhead on dataset creators, but the benefits of increased transparency and accountability in the machine learning community are considered to outweigh the costs.