Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention

Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention

June 3-6, 2024 | CEDRIC DESLANDES WHITNEY, JUSTIN NORMAN
This paper examines two key risks of using synthetic data in machine learning: diversity-washing and consent circumvention. Synthetic data is often used to address the challenges of collecting real-world data, particularly in areas like facial recognition, where real data is difficult to obtain due to logistical and ethical issues. However, synthetic data can introduce significant risks. First, synthetic data can lead to diversity-washing, where datasets appear more diverse but fail to address underlying biases. This is demonstrated through a real-world example where synthetic data was used to evaluate facial recognition technology, resulting in datasets that did not accurately represent real-world diversity. Second, synthetic data can circumvent consent for data usage, as it allows model creators to avoid the ethical and legal implications of collecting data without proper consent. This is illustrated by considering the importance of consent in the U.S. Federal Trade Commission's regulation of data collection and affected models. The paper argues that synthetic data complicates existing governance and ethical practices by decoupling data from those it impacts, potentially consolidating power away from those most affected by algorithmic harm. The paper also discusses the broader implications of synthetic data use, including its potential to exacerbate issues of consent and participation, and calls for further research into the ethical and practical challenges of synthetic data.This paper examines two key risks of using synthetic data in machine learning: diversity-washing and consent circumvention. Synthetic data is often used to address the challenges of collecting real-world data, particularly in areas like facial recognition, where real data is difficult to obtain due to logistical and ethical issues. However, synthetic data can introduce significant risks. First, synthetic data can lead to diversity-washing, where datasets appear more diverse but fail to address underlying biases. This is demonstrated through a real-world example where synthetic data was used to evaluate facial recognition technology, resulting in datasets that did not accurately represent real-world diversity. Second, synthetic data can circumvent consent for data usage, as it allows model creators to avoid the ethical and legal implications of collecting data without proper consent. This is illustrated by considering the importance of consent in the U.S. Federal Trade Commission's regulation of data collection and affected models. The paper argues that synthetic data complicates existing governance and ethical practices by decoupling data from those it impacts, potentially consolidating power away from those most affected by algorithmic harm. The paper also discusses the broader implications of synthetic data use, including its potential to exacerbate issues of consent and participation, and calls for further research into the ethical and practical challenges of synthetic data.
Reach us at info@study.space