24 Nov 2015 | Nicolas Papernot*, Patrick McDaniel*, Somesh Jha†, Matt Fredrikson†, Z. Berkay Celik*, Ananthram Swami§
Deep learning models are vulnerable to adversarial samples, which are inputs designed to mislead neural networks. This paper introduces a novel method for generating adversarial samples by leveraging the forward derivative of deep neural networks (DNNs). The approach constructs adversarial saliency maps to identify input features that, when perturbed, can cause the DNN to misclassify inputs. The method is validated using the LeNet architecture and MNIST dataset, demonstrating a 97% success rate in misclassifying inputs while modifying only 4.02% of the input features. The algorithm is effective across various DNN architectures and is applicable to both supervised and unsupervised learning. The study also explores the impact of adversarial samples on human perception and introduces a framework for evaluating the hardness of sample classes to adversarial perturbations. The paper formalizes the threat model space for DNNs and provides a systematic approach to crafting adversarial samples, emphasizing the importance of understanding the mapping between inputs and outputs of DNNs. The results highlight the vulnerability of DNNs to adversarial attacks and the need for robust defenses against such threats.Deep learning models are vulnerable to adversarial samples, which are inputs designed to mislead neural networks. This paper introduces a novel method for generating adversarial samples by leveraging the forward derivative of deep neural networks (DNNs). The approach constructs adversarial saliency maps to identify input features that, when perturbed, can cause the DNN to misclassify inputs. The method is validated using the LeNet architecture and MNIST dataset, demonstrating a 97% success rate in misclassifying inputs while modifying only 4.02% of the input features. The algorithm is effective across various DNN architectures and is applicable to both supervised and unsupervised learning. The study also explores the impact of adversarial samples on human perception and introduces a framework for evaluating the hardness of sample classes to adversarial perturbations. The paper formalizes the threat model space for DNNs and provides a systematic approach to crafting adversarial samples, emphasizing the importance of understanding the mapping between inputs and outputs of DNNs. The results highlight the vulnerability of DNNs to adversarial attacks and the need for robust defenses against such threats.