Spurious Correlations in Machine Learning: A Survey

Spurious Correlations in Machine Learning: A Survey

16 May 2024 | Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, Aidong Zhang
This paper provides a comprehensive review of spurious correlations in machine learning, which are correlations between non-essential features of inputs and their corresponding labels that do not imply causation. These correlations can negatively impact model performance and generalization, especially in critical domains like healthcare. The paper defines spurious correlations formally and discusses their sources, including dataset biases, imbalanced group labels, and sampling noise. It highlights the role of inductive biases in machine learning models and how they can exacerbate the problem of spurious correlations. The paper also reviews various methods to address spurious correlations, categorized into data manipulation, representation learning, learning strategies, and other methods. Data manipulation techniques include data augmentation and generating auxiliary group information. Representation learning methods focus on improving model representations through causal intervention, feature disentanglement, invariant learning, and contrastive learning. Learning strategies involve optimization-based methods, ensemble learning, identification and mitigation, and finetuning strategies. Other methods address specific problem settings, such as multi-task problems and reinforcement learning. The paper discusses popular datasets and metrics used to evaluate models' robustness to spurious correlations, including worst-group accuracy, average group accuracy, and bias-conflicting accuracy. It concludes with a discussion on future research challenges, such as group-label-free spurious correlation mitigation, automated spurious correlation detection, balancing worst-and-average performance tradeoffs, and rigorous evaluation benchmarks. The paper also explores the potential impact of foundation models on addressing spurious correlations, suggesting that they can be used to detect and mitigate spurious correlations more effectively.This paper provides a comprehensive review of spurious correlations in machine learning, which are correlations between non-essential features of inputs and their corresponding labels that do not imply causation. These correlations can negatively impact model performance and generalization, especially in critical domains like healthcare. The paper defines spurious correlations formally and discusses their sources, including dataset biases, imbalanced group labels, and sampling noise. It highlights the role of inductive biases in machine learning models and how they can exacerbate the problem of spurious correlations. The paper also reviews various methods to address spurious correlations, categorized into data manipulation, representation learning, learning strategies, and other methods. Data manipulation techniques include data augmentation and generating auxiliary group information. Representation learning methods focus on improving model representations through causal intervention, feature disentanglement, invariant learning, and contrastive learning. Learning strategies involve optimization-based methods, ensemble learning, identification and mitigation, and finetuning strategies. Other methods address specific problem settings, such as multi-task problems and reinforcement learning. The paper discusses popular datasets and metrics used to evaluate models' robustness to spurious correlations, including worst-group accuracy, average group accuracy, and bias-conflicting accuracy. It concludes with a discussion on future research challenges, such as group-label-free spurious correlation mitigation, automated spurious correlation detection, balancing worst-and-average performance tradeoffs, and rigorous evaluation benchmarks. The paper also explores the potential impact of foundation models on addressing spurious correlations, suggesting that they can be used to detect and mitigate spurious correlations more effectively.
Reach us at info@study.space