6 Nov 2018 | Amirata Ghorbani*, Abubakar Abid*, James Zou†
Neural network interpretation is fragile, as small perturbations to input data can significantly alter interpretations without changing the predicted label. This fragility is a critical issue for trust and transparency in machine learning applications. The paper demonstrates that adversarial perturbations can generate inputs that are visually indistinguishable but lead to drastically different interpretations. It systematically evaluates the robustness of three popular interpretation methods—feature importance maps, integrated gradients, and DeepLIFT—on ImageNet and CIFAR-10 datasets. The results show that systematic perturbations can lead to dramatically different interpretations without changing the label. The analysis of the Hessian matrix geometry provides insights into why robustness is a general challenge for current interpretation approaches.
The paper also extends these findings to show that interpretations based on exemplars, such as influence functions, are similarly susceptible to adversarial attacks. The fragility of interpretations poses a significant security concern, especially in medical and economic applications where users may interpret predictions as containing causal insights. Adversarial attacks could manipulate input data to draw attention away from relevant features or onto desired ones, making such attacks difficult to detect as the actual labels remain unchanged.
The paper introduces methods for generating adversarial perturbations that alter interpretations while preserving the predicted label. These methods include random sign perturbations, iterative attacks against feature importance methods, and gradient sign attacks against influence functions. The experiments show that these attacks can significantly change the interpretation of neural network predictions, even when the predicted label remains the same.
The analysis of the Hessian matrix highlights the role of high dimensionality and non-linearities in deep networks in making interpretations fragile. The paper also discusses the implications of this fragility for the robustness of neural network interpretations and the need for more secure and reliable interpretation methods. The findings underscore the importance of developing robust interpretation methods that can withstand adversarial attacks and provide reliable insights into the decision-making process of neural networks.Neural network interpretation is fragile, as small perturbations to input data can significantly alter interpretations without changing the predicted label. This fragility is a critical issue for trust and transparency in machine learning applications. The paper demonstrates that adversarial perturbations can generate inputs that are visually indistinguishable but lead to drastically different interpretations. It systematically evaluates the robustness of three popular interpretation methods—feature importance maps, integrated gradients, and DeepLIFT—on ImageNet and CIFAR-10 datasets. The results show that systematic perturbations can lead to dramatically different interpretations without changing the label. The analysis of the Hessian matrix geometry provides insights into why robustness is a general challenge for current interpretation approaches.
The paper also extends these findings to show that interpretations based on exemplars, such as influence functions, are similarly susceptible to adversarial attacks. The fragility of interpretations poses a significant security concern, especially in medical and economic applications where users may interpret predictions as containing causal insights. Adversarial attacks could manipulate input data to draw attention away from relevant features or onto desired ones, making such attacks difficult to detect as the actual labels remain unchanged.
The paper introduces methods for generating adversarial perturbations that alter interpretations while preserving the predicted label. These methods include random sign perturbations, iterative attacks against feature importance methods, and gradient sign attacks against influence functions. The experiments show that these attacks can significantly change the interpretation of neural network predictions, even when the predicted label remains the same.
The analysis of the Hessian matrix highlights the role of high dimensionality and non-linearities in deep networks in making interpretations fragile. The paper also discusses the implications of this fragility for the robustness of neural network interpretations and the need for more secure and reliable interpretation methods. The findings underscore the importance of developing robust interpretation methods that can withstand adversarial attacks and provide reliable insights into the decision-making process of neural networks.