Query-Based Adversarial Prompt Generation

Query-Based Adversarial Prompt Generation

19 Feb 2024 | Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
This paper introduces a query-based adversarial prompt generation method, which leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with higher probability compared to transfer-only attacks. The authors demonstrate the effectiveness of their attack on GPT-3.5 and OpenAI's safety classifier, achieving nearly 100% success in evading the classifier. The key contributions of the paper include: 1. **Query-Based Attack**: The attack directly constructs adversarial examples on a remote language model without relying on transferability, allowing for targeted attacks and surrogates-free attacks. 2. **Optimization**: An optimized version of the Greedy Coordinate Gradient (GCG) attack reduces the dependency on a surrogate model, making it a pure query-based attack. 3. **Evaluation**: The attack is evaluated on open-source models and production models like GPT-3.5 Turbo, showing high success rates in eliciting harmful strings and evading content moderation classifiers. The paper also discusses practical considerations such as scoring prompts with logit-bias and top-5 logprobs, short-circuiting the loss, and choosing better initial prompts. The authors conclude by highlighting the limitations of current NLP adversarial example generation methods and the potential for future improvements.This paper introduces a query-based adversarial prompt generation method, which leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with higher probability compared to transfer-only attacks. The authors demonstrate the effectiveness of their attack on GPT-3.5 and OpenAI's safety classifier, achieving nearly 100% success in evading the classifier. The key contributions of the paper include: 1. **Query-Based Attack**: The attack directly constructs adversarial examples on a remote language model without relying on transferability, allowing for targeted attacks and surrogates-free attacks. 2. **Optimization**: An optimized version of the Greedy Coordinate Gradient (GCG) attack reduces the dependency on a surrogate model, making it a pure query-based attack. 3. **Evaluation**: The attack is evaluated on open-source models and production models like GPT-3.5 Turbo, showing high success rates in eliciting harmful strings and evading content moderation classifiers. The paper also discusses practical considerations such as scoring prompts with logit-bias and top-5 logprobs, short-circuiting the loss, and choosing better initial prompts. The authors conclude by highlighting the limitations of current NLP adversarial example generation methods and the potential for future improvements.
Reach us at info@study.space