Position: Measure Dataset Diversity, Don’t Just Claim It

Position: Measure Dataset Diversity, Don’t Just Claim It

2024 | Dora Zhao, Jerone T. A. Andrews, Orestis Papakyriakopoulos, Alice Xiang
This paper argues that machine learning (ML) datasets, often seen as neutral, are inherently infused with social values and ideologies. The authors emphasize the need for clear, precise definitions of terms like diversity, bias, and quality in dataset creation, as these terms are frequently used without clear boundaries. By applying principles from measurement theory, the paper provides a structured approach to conceptualizing, operationalizing, and evaluating diversity in ML datasets. The authors analyze 135 image and text datasets, identifying inconsistencies in how diversity is defined and operationalized. They highlight the importance of transparency in defining diversity and ensuring that data collection processes align with these definitions. The paper also discusses the challenges of evaluating diversity, including reliability and validity, and presents methodologies for assessing these aspects. A case study on the Segment Anything dataset (SA-1B) illustrates the practical application of these recommendations. The authors advocate for a more nuanced and precise approach to handling value-laden properties in dataset construction, emphasizing the need for clear definitions, transparent processes, and rigorous evaluation to enhance the reliability, reproducibility, and fairness of ML research. The paper underscores the broader implications of these considerations for the development of ML and scientific practices, calling for a more systematic and thoughtful approach to dataset creation and evaluation.This paper argues that machine learning (ML) datasets, often seen as neutral, are inherently infused with social values and ideologies. The authors emphasize the need for clear, precise definitions of terms like diversity, bias, and quality in dataset creation, as these terms are frequently used without clear boundaries. By applying principles from measurement theory, the paper provides a structured approach to conceptualizing, operationalizing, and evaluating diversity in ML datasets. The authors analyze 135 image and text datasets, identifying inconsistencies in how diversity is defined and operationalized. They highlight the importance of transparency in defining diversity and ensuring that data collection processes align with these definitions. The paper also discusses the challenges of evaluating diversity, including reliability and validity, and presents methodologies for assessing these aspects. A case study on the Segment Anything dataset (SA-1B) illustrates the practical application of these recommendations. The authors advocate for a more nuanced and precise approach to handling value-laden properties in dataset construction, emphasizing the need for clear definitions, transparent processes, and rigorous evaluation to enhance the reliability, reproducibility, and fairness of ML research. The paper underscores the broader implications of these considerations for the development of ML and scientific practices, calling for a more systematic and thoughtful approach to dataset creation and evaluation.
Reach us at info@study.space