19 Feb 2024 | Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
This paper introduces a query-based adversarial prompt generation method, which leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with higher probability compared to transfer-only attacks. The authors demonstrate the effectiveness of their attack on GPT-3.5 and OpenAI's safety classifier, achieving nearly 100% success in evading the classifier. The key contributions of the paper include:
1. **Query-Based Attack**: The attack directly constructs adversarial examples on a remote language model without relying on transferability, allowing for targeted attacks and surrogates-free attacks.
2. **Optimization**: An optimized version of the Greedy Coordinate Gradient (GCG) attack reduces the dependency on a surrogate model, making it a pure query-based attack.
3. **Evaluation**: The attack is evaluated on open-source models and production models like GPT-3.5 Turbo, showing high success rates in eliciting harmful strings and evading content moderation classifiers.
The paper also discusses practical considerations such as scoring prompts with logit-bias and top-5 logprobs, short-circuiting the loss, and choosing better initial prompts. The authors conclude by highlighting the limitations of current NLP adversarial example generation methods and the potential for future improvements.This paper introduces a query-based adversarial prompt generation method, which leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with higher probability compared to transfer-only attacks. The authors demonstrate the effectiveness of their attack on GPT-3.5 and OpenAI's safety classifier, achieving nearly 100% success in evading the classifier. The key contributions of the paper include:
1. **Query-Based Attack**: The attack directly constructs adversarial examples on a remote language model without relying on transferability, allowing for targeted attacks and surrogates-free attacks.
2. **Optimization**: An optimized version of the Greedy Coordinate Gradient (GCG) attack reduces the dependency on a surrogate model, making it a pure query-based attack.
3. **Evaluation**: The attack is evaluated on open-source models and production models like GPT-3.5 Turbo, showing high success rates in eliciting harmful strings and evading content moderation classifiers.
The paper also discusses practical considerations such as scoring prompts with logit-bias and top-5 logprobs, short-circuiting the loss, and choosing better initial prompts. The authors conclude by highlighting the limitations of current NLP adversarial example generation methods and the potential for future improvements.