DECEMBER 2021 | BY TIMNIT GEBRU, JAMIE MORGENSTERN, BRIANA VECCHIONE, JENNIFER WORTMAN VAUGHAN, HANNA WALLACH, HAL DAUMÉ III, AND KATE CRAWFORD
The chapter discusses the critical role of data in machine learning and the importance of documenting datasets to ensure transparency, accountability, and mitigate societal biases. It highlights the lack of standardized processes for documenting machine learning datasets and proposes the creation of "datasheets for datasets." These datasheets would provide detailed information about the motivation, composition, collection process, preprocessing, uses, distribution, and maintenance of a dataset. The authors outline the development process, including feedback from various stakeholders, and provide a set of questions to guide the creation of datasheets. They also discuss the impact and challenges of implementing datasheets, emphasizing the need for organizational adjustments and the limitations of the approach in addressing all potential issues. Despite these challenges, the authors believe that the benefits of datasheets outweigh the costs, contributing to better communication and transparency in the machine learning community.The chapter discusses the critical role of data in machine learning and the importance of documenting datasets to ensure transparency, accountability, and mitigate societal biases. It highlights the lack of standardized processes for documenting machine learning datasets and proposes the creation of "datasheets for datasets." These datasheets would provide detailed information about the motivation, composition, collection process, preprocessing, uses, distribution, and maintenance of a dataset. The authors outline the development process, including feedback from various stakeholders, and provide a set of questions to guide the creation of datasheets. They also discuss the impact and challenges of implementing datasheets, emphasizing the need for organizational adjustments and the limitations of the approach in addressing all potential issues. Despite these challenges, the authors believe that the benefits of datasheets outweigh the costs, contributing to better communication and transparency in the machine learning community.