2024 | Dora Zhao, Jerone T. A. Andrews, Orestis Papakyriakopoulos, Alice Xiang
This paper explores the inherent value-laden nature of machine learning (ML) datasets, which are often perceived as neutral but are infused with social, political, and ethical ideologies. The authors argue that terms like "diversity," "bias," and "quality" lack clear definitions and validation, leading to issues in dataset construction and evaluation. They apply measurement theory from social sciences to analyze 135 image and text datasets, identifying key considerations and offering recommendations for conceptualizing, operationalizing, and evaluating diversity. The paper emphasizes the need for precise definitions, transparent documentation, and robust validation methods to ensure that datasets genuinely embody the claimed qualities. The authors provide a structured approach to handling value-laden properties in dataset construction, advocating for a more nuanced and precise approach to enhance transparency, reliability, and reproducibility in ML research. The paper concludes with a case study on the Segment Anything dataset to illustrate the practical application of these recommendations.This paper explores the inherent value-laden nature of machine learning (ML) datasets, which are often perceived as neutral but are infused with social, political, and ethical ideologies. The authors argue that terms like "diversity," "bias," and "quality" lack clear definitions and validation, leading to issues in dataset construction and evaluation. They apply measurement theory from social sciences to analyze 135 image and text datasets, identifying key considerations and offering recommendations for conceptualizing, operationalizing, and evaluating diversity. The paper emphasizes the need for precise definitions, transparent documentation, and robust validation methods to ensure that datasets genuinely embody the claimed qualities. The authors provide a structured approach to handling value-laden properties in dataset construction, advocating for a more nuanced and precise approach to enhance transparency, reliability, and reproducibility in ML research. The paper concludes with a case study on the Segment Anything dataset to illustrate the practical application of these recommendations.