Spurious correlations in machine learning refer to relationships between input features and labels that do not imply causation but are coincidental or influenced by external factors. These correlations can negatively impact model generalization and robustness, especially when data distributions shift in real-world applications. This survey provides a comprehensive overview of spurious correlations, including their causes, effects, and current methods for mitigation. It also summarizes existing datasets, benchmarks, and metrics to aid future research.
Spurious correlations often arise from dataset biases, imbalanced group labels, and sampling noise. They can lead to models relying on irrelevant features, resulting in poor performance on new data. Machine learning models are sensitive to these correlations due to inductive biases and the optimization process under Empirical Risk Minimization (ERM), which may prioritize spurious patterns over true causal relationships.
The paper discusses various methods to address spurious correlations, including data manipulation, representation learning, learning strategies, and other techniques. Data manipulation methods such as data augmentation and pseudo-label discovery aim to enhance data diversity and reduce spurious correlations. Representation learning techniques like causal intervention, invariant learning, and feature disentanglement help models learn more robust and generalizable features. Learning strategies, including optimization-based methods, ensemble learning, and adversarial training, aim to improve model robustness and fairness.
The survey also highlights the importance of evaluating models using metrics like worst-group accuracy, average accuracy, and bias-conflicting accuracy to assess their robustness to spurious correlations. Future research challenges include developing group-label-free methods, automating spurious correlation detection, balancing worst-group and average performance, and creating rigorous evaluation benchmarks. Foundation models may play a significant role in addressing spurious correlations by leveraging their large-scale training data and ability to generalize across diverse domains. The survey concludes that spurious correlations remain a critical challenge in machine learning, requiring continued research to improve model robustness and fairness.Spurious correlations in machine learning refer to relationships between input features and labels that do not imply causation but are coincidental or influenced by external factors. These correlations can negatively impact model generalization and robustness, especially when data distributions shift in real-world applications. This survey provides a comprehensive overview of spurious correlations, including their causes, effects, and current methods for mitigation. It also summarizes existing datasets, benchmarks, and metrics to aid future research.
Spurious correlations often arise from dataset biases, imbalanced group labels, and sampling noise. They can lead to models relying on irrelevant features, resulting in poor performance on new data. Machine learning models are sensitive to these correlations due to inductive biases and the optimization process under Empirical Risk Minimization (ERM), which may prioritize spurious patterns over true causal relationships.
The paper discusses various methods to address spurious correlations, including data manipulation, representation learning, learning strategies, and other techniques. Data manipulation methods such as data augmentation and pseudo-label discovery aim to enhance data diversity and reduce spurious correlations. Representation learning techniques like causal intervention, invariant learning, and feature disentanglement help models learn more robust and generalizable features. Learning strategies, including optimization-based methods, ensemble learning, and adversarial training, aim to improve model robustness and fairness.
The survey also highlights the importance of evaluating models using metrics like worst-group accuracy, average accuracy, and bias-conflicting accuracy to assess their robustness to spurious correlations. Future research challenges include developing group-label-free methods, automating spurious correlation detection, balancing worst-group and average performance, and creating rigorous evaluation benchmarks. Foundation models may play a significant role in addressing spurious correlations by leveraging their large-scale training data and ability to generalize across diverse domains. The survey concludes that spurious correlations remain a critical challenge in machine learning, requiring continued research to improve model robustness and fairness.