Generating Natural Language Adversarial Examples

Generating Natural Language Adversarial Examples

24 Sep 2018 | Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, Kai-Wei Chang
This paper presents a method for generating adversarial examples in the natural language domain using a black-box population-based optimization algorithm. The authors demonstrate that adversarial examples can be created that are semantically and syntactically similar to the original text, yet cause well-trained models to misclassify. The method is tested on two tasks: sentiment analysis and textual entailment. For sentiment analysis, the attack achieves a success rate of 97%, while for textual entailment, it achieves 70%. The generated adversarial examples are also perceptibly similar to the original text, as shown by a human study where 92.3% of the examples were classified to their original label by 20 human annotators. The authors also attempt to defend against adversarial attacks using adversarial training, but find that it does not improve the robustness of the models. This suggests that the adversarial examples are diverse and challenging to defend against. The attack method is based on a genetic algorithm that iteratively evolves candidate solutions to find the most effective adversarial examples. The algorithm selects words to replace based on their semantic similarity and syntactic coherence, and uses a fitness function to evaluate the effectiveness of each modification. The attack is evaluated on two tasks: sentiment analysis and textual entailment. For sentiment analysis, the attack successfully changes the sentiment of the text from positive to negative and vice versa. For textual entailment, the attack changes the prediction from 'entailment' to 'contradiction' and vice versa. The attack is limited to a maximum of 20 iterations and a maximum of 20% or 25% changes to the text. The results show that the attack is effective, with high success rates and reasonable runtimes. The authors also conduct a user study to evaluate the perceptibility of the adversarial examples. The results show that the adversarial examples are perceptibly similar to the original text, with an average rating of 2.23 on a scale from 1 to 4. The authors conclude that their method is effective in generating adversarial examples in the natural language domain and encourage further research into improving the robustness of deep neural networks in this domain.This paper presents a method for generating adversarial examples in the natural language domain using a black-box population-based optimization algorithm. The authors demonstrate that adversarial examples can be created that are semantically and syntactically similar to the original text, yet cause well-trained models to misclassify. The method is tested on two tasks: sentiment analysis and textual entailment. For sentiment analysis, the attack achieves a success rate of 97%, while for textual entailment, it achieves 70%. The generated adversarial examples are also perceptibly similar to the original text, as shown by a human study where 92.3% of the examples were classified to their original label by 20 human annotators. The authors also attempt to defend against adversarial attacks using adversarial training, but find that it does not improve the robustness of the models. This suggests that the adversarial examples are diverse and challenging to defend against. The attack method is based on a genetic algorithm that iteratively evolves candidate solutions to find the most effective adversarial examples. The algorithm selects words to replace based on their semantic similarity and syntactic coherence, and uses a fitness function to evaluate the effectiveness of each modification. The attack is evaluated on two tasks: sentiment analysis and textual entailment. For sentiment analysis, the attack successfully changes the sentiment of the text from positive to negative and vice versa. For textual entailment, the attack changes the prediction from 'entailment' to 'contradiction' and vice versa. The attack is limited to a maximum of 20 iterations and a maximum of 20% or 25% changes to the text. The results show that the attack is effective, with high success rates and reasonable runtimes. The authors also conduct a user study to evaluate the perceptibility of the adversarial examples. The results show that the adversarial examples are perceptibly similar to the original text, with an average rating of 2.23 on a scale from 1 to 4. The authors conclude that their method is effective in generating adversarial examples in the natural language domain and encourage further research into improving the robustness of deep neural networks in this domain.
Reach us at info@study.space
[slides and audio] Generating Natural Language Adversarial Examples