Understanding Universal and Transferable Adversarial Attacks on Aligned Language Models

This paper presents a novel and effective method to cause aligned language models to generate objectionable content. The approach involves finding a suffix that, when added to a wide range of queries, maximizes the probability of the model producing an affirmative response to undesirable content. Unlike previous methods that relied on manual engineering, this attack combines greedy and gradient-based search techniques to automatically generate adversarial suffixes. The results show that these adversarial prompts are highly transferable, successfully inducing objectionable content in multiple models, including black-box, publicly released production models such as ChatGPT, Bard, and Claude. The success rate is particularly high against GPT-based models, possibly due to the training process involving outputs from ChatGPT. The paper also discusses the implications of these findings for the alignment of language models and the need for more robust defense mechanisms. The code for this attack is available at [github.com/llm-attacks/llm-attacks](https://github.com/llm-attacks/llm-attacks).This paper presents a novel and effective method to cause aligned language models to generate objectionable content. The approach involves finding a suffix that, when added to a wide range of queries, maximizes the probability of the model producing an affirmative response to undesirable content. Unlike previous methods that relied on manual engineering, this attack combines greedy and gradient-based search techniques to automatically generate adversarial suffixes. The results show that these adversarial prompts are highly transferable, successfully inducing objectionable content in multiple models, including black-box, publicly released production models such as ChatGPT, Bard, and Claude. The success rate is particularly high against GPT-based models, possibly due to the training process involving outputs from ChatGPT. The paper also discusses the implications of these findings for the alignment of language models and the need for more robust defense mechanisms. The code for this attack is available at [github.com/llm-attacks/llm-attacks](https://github.com/llm-attacks/llm-attacks).

Universal and Transferable Adversarial Attacks on Aligned Language Models

20 Dec 2023 | Andy Zou1,2, Zifan Wang2, Nicholas Carlini3, Milad Nasr3, J. Zico Kolter1,4, Matt Fredrikson1