[slides] Adversarial Examples Are Not Bugs%2C They Are Features

The paper "Adversarial Examples Are Not Bugs, They Are Features" by Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry explores the reasons behind the existence and prevalence of adversarial examples in machine learning. The authors propose that these examples are not inherent bugs but rather a consequence of the presence of *non-robust features*—features that are highly predictive but brittle and incomprehensible to humans. They demonstrate that these features are widespread in standard datasets and show that adversarial vulnerability arises from the model's sensitivity to such features. The paper introduces a theoretical framework to capture these features and establish their widespread existence. It also presents a setting where the phenomena observed in practice can be rigorously tied to a *misalignment* between the human-specified notion of robustness and the inherent geometry of the data. The authors provide empirical evidence to support their hypothesis, including the ability to disentangle robust from non-robust features in standard image classification datasets and the construction of datasets that appear mislabeled to humans but still yield good accuracy on the original test set. The paper further discusses the implications of their findings, suggesting that adversarial vulnerability is a human-centric phenomenon and that approaches aiming to enhance model interpretability may hide meaningful features. Finally, it provides a theoretical framework to study adversarial vulnerability, robust training, and gradient alignment in a specific classification task involving Gaussian distributions.The paper "Adversarial Examples Are Not Bugs, They Are Features" by Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry explores the reasons behind the existence and prevalence of adversarial examples in machine learning. The authors propose that these examples are not inherent bugs but rather a consequence of the presence of *non-robust features*—features that are highly predictive but brittle and incomprehensible to humans. They demonstrate that these features are widespread in standard datasets and show that adversarial vulnerability arises from the model's sensitivity to such features. The paper introduces a theoretical framework to capture these features and establish their widespread existence. It also presents a setting where the phenomena observed in practice can be rigorously tied to a *misalignment* between the human-specified notion of robustness and the inherent geometry of the data. The authors provide empirical evidence to support their hypothesis, including the ability to disentangle robust from non-robust features in standard image classification datasets and the construction of datasets that appear mislabeled to humans but still yield good accuracy on the original test set. The paper further discusses the implications of their findings, suggesting that adversarial vulnerability is a human-centric phenomenon and that approaches aiming to enhance model interpretability may hide meaningful features. Finally, it provides a theoretical framework to study adversarial vulnerability, robust training, and gradient alignment in a specific classification task involving Gaussian distributions.

Adversarial Examples Are Not Bugs, They Are Features

12 Aug 2019 | Andrew Ilyas*, Shibani Santurkar*, Dimitris Tsipras*, Logan Engstrom*, Brandon Tran, Aleksander Madry

12 Aug 2019 | Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry