22 Jan 2018 | Brian Hu Zhang, Blake Lemoine, Margaret Mitchell
This paper presents a framework for mitigating biases in machine learning models by incorporating a protected variable and learning both a predictor and an adversary. The goal is to maximize the predictor's ability to predict the target variable while minimizing the adversary's ability to predict the protected variable. This approach is applied to tasks such as analogy completion and classification using the UCI Adult dataset, achieving near-equality of odds with minimal accuracy loss. The method is flexible and applicable to various definitions of fairness and gradient-based learning models, including regression and classification tasks.
The framework uses adversarial learning, where the predictor is trained to minimize prediction loss while the adversary tries to predict the protected variable. The gradients from the adversary are incorporated into the predictor's weight updates to reduce information about the protected variable in the predictions. This technique is applied to both discrete and continuous protected variables, including gender inferred from word embeddings.
Theoretical guarantees are provided, showing that under certain conditions, the method can enforce fairness constraints such as demographic parity, equality of odds, and equality of opportunity. Experiments on the UCI Adult dataset demonstrate that the method achieves equality of odds with minimal accuracy loss. Additionally, the method is applied to debiasing word embeddings, where it successfully reduces gender bias in analogy tasks while maintaining performance.
The paper also discusses challenges in training these models, including difficulties in convergence and the need for careful hyperparameter tuning. It proposes a simple adversary that can be used regardless of the complexity of the underlying model. The results show that the method effectively reduces bias while maintaining performance on the target task. Future work includes exploring the application of the method to more complex tasks and ensuring the stability of adversarial training.This paper presents a framework for mitigating biases in machine learning models by incorporating a protected variable and learning both a predictor and an adversary. The goal is to maximize the predictor's ability to predict the target variable while minimizing the adversary's ability to predict the protected variable. This approach is applied to tasks such as analogy completion and classification using the UCI Adult dataset, achieving near-equality of odds with minimal accuracy loss. The method is flexible and applicable to various definitions of fairness and gradient-based learning models, including regression and classification tasks.
The framework uses adversarial learning, where the predictor is trained to minimize prediction loss while the adversary tries to predict the protected variable. The gradients from the adversary are incorporated into the predictor's weight updates to reduce information about the protected variable in the predictions. This technique is applied to both discrete and continuous protected variables, including gender inferred from word embeddings.
Theoretical guarantees are provided, showing that under certain conditions, the method can enforce fairness constraints such as demographic parity, equality of odds, and equality of opportunity. Experiments on the UCI Adult dataset demonstrate that the method achieves equality of odds with minimal accuracy loss. Additionally, the method is applied to debiasing word embeddings, where it successfully reduces gender bias in analogy tasks while maintaining performance.
The paper also discusses challenges in training these models, including difficulties in convergence and the need for careful hyperparameter tuning. It proposes a simple adversary that can be used regardless of the complexity of the underlying model. The results show that the method effectively reduces bias while maintaining performance on the target task. Future work includes exploring the application of the method to more complex tasks and ensuring the stability of adversarial training.