Understanding Real Risks of Fake Data%3A Synthetic Data%2C Diversity-Washing and Consent Circumvention

The paper "Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention" by Cedric Deslandes Whitney and Justin Norman explores the ethical and practical risks associated with the use of synthetic data in machine learning, particularly in the context of facial recognition technology (FRT). The authors highlight two key risks: 1. **Diversity-Washing**: Synthetic data can be used to create datasets that appear more diverse and representative, but this can be superficial and does not address underlying biases. The paper uses a real-world example of using synthetic data for FRT evaluation to illustrate how synthetic data can fail to mitigate bias in data distribution and representation. It also discusses how synthetic data can perpetuate harm by appearing legitimate when it is not. 2. **Circumvention of Consent**: Synthetic data can be used to circumvent consent requirements for data usage, complicating regulatory enforcement. The U.S. Federal Trade Commission (FTC) enforces regulations on data collection and model deployment, often based on the absence of proper consent. Synthetic data makes it easier for model creators to obfuscate the origins and consent of the data used, making it difficult for regulatory bodies to enforce compliance. The authors argue that these risks highlight the need for responsible use of synthetic data, emphasizing the importance of maintaining transparency, accountability, and ethical guidelines in the development and deployment of machine learning systems. They call for further research to address these challenges and to ensure that synthetic data is used in a way that respects the rights and interests of all stakeholders.The paper "Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention" by Cedric Deslandes Whitney and Justin Norman explores the ethical and practical risks associated with the use of synthetic data in machine learning, particularly in the context of facial recognition technology (FRT). The authors highlight two key risks: 1. **Diversity-Washing**: Synthetic data can be used to create datasets that appear more diverse and representative, but this can be superficial and does not address underlying biases. The paper uses a real-world example of using synthetic data for FRT evaluation to illustrate how synthetic data can fail to mitigate bias in data distribution and representation. It also discusses how synthetic data can perpetuate harm by appearing legitimate when it is not. 2. **Circumvention of Consent**: Synthetic data can be used to circumvent consent requirements for data usage, complicating regulatory enforcement. The U.S. Federal Trade Commission (FTC) enforces regulations on data collection and model deployment, often based on the absence of proper consent. Synthetic data makes it easier for model creators to obfuscate the origins and consent of the data used, making it difficult for regulatory bodies to enforce compliance. The authors argue that these risks highlight the need for responsible use of synthetic data, emphasizing the importance of maintaining transparency, accountability, and ethical guidelines in the development and deployment of machine learning systems. They call for further research to address these challenges and to ensure that synthetic data is used in a way that respects the rights and interests of all stakeholders.

Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention

June 3–6, 2024, Rio de Janeiro, Brazil | CEDRIC DESLANDES WHITNEY, JUSTIN NORMAN