Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

20 Dec 2023 | Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
This paper presents a universal and transferable adversarial attack method that can induce aligned language models (LLMs) to generate objectionable content. The attack works by appending a carefully crafted adversarial suffix to user queries, which aims to maximize the probability that the model produces an affirmative response rather than refusing to answer. The method automatically generates these adversarial suffixes using a combination of greedy and gradient-based search techniques, improving upon previous automatic prompt generation methods. The adversarial prompts generated by this approach are highly transferable, including to black-box, publicly released, production LLMs. The attack was tested on multiple models, including Vicuna-7B, Vicuna-13B, Pythia, Falcon, Guanaco, and others. The success rate of the attack was particularly high against GPT-based models, potentially due to the fact that Vicuna is trained on outputs from ChatGPT. The attack was able to generate 99 out of 100 harmful behaviors in Vicuna and 88 out of 100 exact matches with a target harmful string in its output. The attack also achieved 84% success rates against GPT-3.5 and GPT-4, and 66% against PaLM-2. The attack is based on three key elements: (1) initial affirmative responses, (2) combined greedy and gradient-based discrete optimization, and (3) robust multi-prompt and multi-model attacks. The attack was able to reliably generate adversarial suffixes that circumvent the alignment of a target language model. The results show that the attack is highly effective in eliciting harmful behaviors from a wide range of models, including both open-source and proprietary LLMs. The paper also discusses the implications of these findings for the alignment of LLMs. The results suggest that current alignment methods may be insufficient to prevent adversarial attacks, and that further research is needed to develop more robust alignment strategies. The paper also highlights the importance of responsible disclosure and ethical considerations in the development and use of adversarial attacks.This paper presents a universal and transferable adversarial attack method that can induce aligned language models (LLMs) to generate objectionable content. The attack works by appending a carefully crafted adversarial suffix to user queries, which aims to maximize the probability that the model produces an affirmative response rather than refusing to answer. The method automatically generates these adversarial suffixes using a combination of greedy and gradient-based search techniques, improving upon previous automatic prompt generation methods. The adversarial prompts generated by this approach are highly transferable, including to black-box, publicly released, production LLMs. The attack was tested on multiple models, including Vicuna-7B, Vicuna-13B, Pythia, Falcon, Guanaco, and others. The success rate of the attack was particularly high against GPT-based models, potentially due to the fact that Vicuna is trained on outputs from ChatGPT. The attack was able to generate 99 out of 100 harmful behaviors in Vicuna and 88 out of 100 exact matches with a target harmful string in its output. The attack also achieved 84% success rates against GPT-3.5 and GPT-4, and 66% against PaLM-2. The attack is based on three key elements: (1) initial affirmative responses, (2) combined greedy and gradient-based discrete optimization, and (3) robust multi-prompt and multi-model attacks. The attack was able to reliably generate adversarial suffixes that circumvent the alignment of a target language model. The results show that the attack is highly effective in eliciting harmful behaviors from a wide range of models, including both open-source and proprietary LLMs. The paper also discusses the implications of these findings for the alignment of LLMs. The results suggest that current alignment methods may be insufficient to prevent adversarial attacks, and that further research is needed to develop more robust alignment strategies. The paper also highlights the importance of responsible disclosure and ethical considerations in the development and use of adversarial attacks.
Reach us at info@study.space
[slides] Universal and Transferable Adversarial Attacks on Aligned Language Models | StudySpace