12 Aug 2019 | Andrew Ilyas*, Shibani Santurkar*, Dimitris Tsipras*, Logan Engstrom*, Brandon Tran, Aleksander Madry
Adversarial examples are not bugs but features inherent to machine learning models. This paper argues that adversarial examples arise from non-robust features—highly predictive but brittle features that are hard for humans to understand. These features are prevalent in standard datasets and can be disentangled to show their role in adversarial vulnerability. The paper demonstrates that adversarial examples can be generated by perturbing non-robust features, which are useful for standard classification but can be exploited by adversaries to cause misclassification.
The authors propose a theoretical framework to distinguish robust and non-robust features. Robust features remain useful under adversarial perturbations, while non-robust features are easily manipulated. They show that by removing non-robust features from datasets, classifiers can achieve better robustness. Conversely, datasets constructed using only non-robust features can still produce classifiers with good standard accuracy, indicating that non-robust features are essential for classification.
The paper also explores adversarial transferability, showing that adversarial examples can transfer between models due to shared non-robust features. This suggests that adversarial vulnerability is a human-centric phenomenon, as non-robust features can be as important as robust ones for standard classification. The authors argue that approaches aiming to improve model interpretability by enforcing "priors" may actually hide meaningful, predictive features that are crucial for classification.
The paper presents experiments that support these claims, including the construction of datasets with and without non-robust features, and the analysis of adversarial examples in different settings. The results show that non-robust features are sufficient for standard classification and that adversarial examples can arise from perturbing these features. The paper concludes that adversarial examples are a natural consequence of the presence of non-robust features in standard machine learning datasets, and that robustness and interpretability require explicitly encoding human priors into the training process.Adversarial examples are not bugs but features inherent to machine learning models. This paper argues that adversarial examples arise from non-robust features—highly predictive but brittle features that are hard for humans to understand. These features are prevalent in standard datasets and can be disentangled to show their role in adversarial vulnerability. The paper demonstrates that adversarial examples can be generated by perturbing non-robust features, which are useful for standard classification but can be exploited by adversaries to cause misclassification.
The authors propose a theoretical framework to distinguish robust and non-robust features. Robust features remain useful under adversarial perturbations, while non-robust features are easily manipulated. They show that by removing non-robust features from datasets, classifiers can achieve better robustness. Conversely, datasets constructed using only non-robust features can still produce classifiers with good standard accuracy, indicating that non-robust features are essential for classification.
The paper also explores adversarial transferability, showing that adversarial examples can transfer between models due to shared non-robust features. This suggests that adversarial vulnerability is a human-centric phenomenon, as non-robust features can be as important as robust ones for standard classification. The authors argue that approaches aiming to improve model interpretability by enforcing "priors" may actually hide meaningful, predictive features that are crucial for classification.
The paper presents experiments that support these claims, including the construction of datasets with and without non-robust features, and the analysis of adversarial examples in different settings. The results show that non-robust features are sufficient for standard classification and that adversarial examples can arise from perturbing these features. The paper concludes that adversarial examples are a natural consequence of the presence of non-robust features in standard machine learning datasets, and that robustness and interpretability require explicitly encoding human priors into the training process.