HotFlip: White-Box Adversarial Examples for Text Classification

HotFlip: White-Box Adversarial Examples for Text Classification

Melbourne, Australia, July 15 - 20, 2018. | Javid Ebrahimi*, Anyi Rao†, Daniel Lowd*, Dejing Dou*
The paper "HotFlip: White-Box Adversarial Examples for Text Classification" by Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou introduces a method called HotFlip for generating adversarial examples in text classification tasks. The authors propose an efficient gradient-based optimization method to manipulate discrete text structures using one-hot representations. HotFlip focuses on character-level manipulations such as flips, insertions, and deletions, which can significantly increase the misclassification error of neural classifiers. The method uses directional derivatives to estimate the loss increase for each manipulation and employs beam search to find the most effective set of changes. The paper demonstrates that white-box adversaries are more effective than black-box adversaries in generating adversarial examples, and adversarial training using HotFlip improves the model's robustness against such attacks. Experiments on the AG's news dataset show that HotFlip can successfully trick the classifier with a high success rate, even at low confidence levels. The authors also evaluate the human perception of these adversarial examples, finding that they rarely alter the meaning of sentences. Additionally, the paper explores the applicability of HotFlip to word-level models and discusses the challenges and constraints in preserving semantic meaning during adversarial manipulations. The study concludes by highlighting the importance of understanding the vulnerabilities of NLP models to adversarial attacks and the need for further research in this area.The paper "HotFlip: White-Box Adversarial Examples for Text Classification" by Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou introduces a method called HotFlip for generating adversarial examples in text classification tasks. The authors propose an efficient gradient-based optimization method to manipulate discrete text structures using one-hot representations. HotFlip focuses on character-level manipulations such as flips, insertions, and deletions, which can significantly increase the misclassification error of neural classifiers. The method uses directional derivatives to estimate the loss increase for each manipulation and employs beam search to find the most effective set of changes. The paper demonstrates that white-box adversaries are more effective than black-box adversaries in generating adversarial examples, and adversarial training using HotFlip improves the model's robustness against such attacks. Experiments on the AG's news dataset show that HotFlip can successfully trick the classifier with a high success rate, even at low confidence levels. The authors also evaluate the human perception of these adversarial examples, finding that they rarely alter the meaning of sentences. Additionally, the paper explores the applicability of HotFlip to word-level models and discusses the challenges and constraints in preserving semantic meaning during adversarial manipulations. The study concludes by highlighting the importance of understanding the vulnerabilities of NLP models to adversarial attacks and the need for further research in this area.
Reach us at info@study.space