17 Apr 2025 | Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
This paper demonstrates that even the most recent safety-aligned large language models (LLMs) are vulnerable to simple adaptive jailbreaking attacks. The authors show that by leveraging access to logprobs, they can successfully jailbreak a wide range of LLMs, including Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2, achieving a 100% attack success rate. They also show how to jailbreak all Claude models via transfer or prefilling attacks with 100% success rate. Additionally, they demonstrate how to use random search on a restricted set of tokens to find trojan strings in poisoned models, which is the algorithm that brought them first place in the SaTML'24 Trojan Detection Competition.
The common theme behind these attacks is that adaptivity is crucial. Different models are vulnerable to different prompting templates, some have unique vulnerabilities based on their APIs, and in some settings, it is crucial to restrict the token search space based on prior knowledge. The authors provide code, logs, and jailbreak artifacts in the JailbreakBench format for reproducibility.
The paper also discusses the importance of adaptive attacks for accurate robustness evaluations of LLMs. The attacks presented illustrate how model-specific adaptive attacks can be designed. The results show that both open-weight and proprietary models are completely non-robust to adversarial attacks. Adaptive attacks play a key role in the evaluation of robustness, as no single method can generalize across all target models.
The paper also discusses related work, including manual attacks, direct search attacks, and LLM-assisted attacks. The authors propose a simple random search algorithm adapted for jailbreaking language models, which is effective in finding adversarial suffixes. They also discuss self-transfer, which is a technique that leverages adversarial suffixes found by random search for simpler harmful requests as initialization for more challenging requests.
The paper presents results on several LLMs, including Llama-2, Llama-3, and Gemma models, as well as R2D2 and GPT models. The results show that the authors' composite attack strategy, which combines prompting, random search, and self-transfer, achieves a 100% attack success rate for all LLMs, surpassing all existing methods. The paper also discusses the effectiveness of different attack methods on different models, including the use of prefilling attacks on Claude models.
The paper also discusses the importance of adaptive attacks for trojan detection, showing that finding universal trojan strings in poisoned models is nearly identical to the standard jailbreaking setting. The authors describe their winning solution for the troThis paper demonstrates that even the most recent safety-aligned large language models (LLMs) are vulnerable to simple adaptive jailbreaking attacks. The authors show that by leveraging access to logprobs, they can successfully jailbreak a wide range of LLMs, including Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2, achieving a 100% attack success rate. They also show how to jailbreak all Claude models via transfer or prefilling attacks with 100% success rate. Additionally, they demonstrate how to use random search on a restricted set of tokens to find trojan strings in poisoned models, which is the algorithm that brought them first place in the SaTML'24 Trojan Detection Competition.
The common theme behind these attacks is that adaptivity is crucial. Different models are vulnerable to different prompting templates, some have unique vulnerabilities based on their APIs, and in some settings, it is crucial to restrict the token search space based on prior knowledge. The authors provide code, logs, and jailbreak artifacts in the JailbreakBench format for reproducibility.
The paper also discusses the importance of adaptive attacks for accurate robustness evaluations of LLMs. The attacks presented illustrate how model-specific adaptive attacks can be designed. The results show that both open-weight and proprietary models are completely non-robust to adversarial attacks. Adaptive attacks play a key role in the evaluation of robustness, as no single method can generalize across all target models.
The paper also discusses related work, including manual attacks, direct search attacks, and LLM-assisted attacks. The authors propose a simple random search algorithm adapted for jailbreaking language models, which is effective in finding adversarial suffixes. They also discuss self-transfer, which is a technique that leverages adversarial suffixes found by random search for simpler harmful requests as initialization for more challenging requests.
The paper presents results on several LLMs, including Llama-2, Llama-3, and Gemma models, as well as R2D2 and GPT models. The results show that the authors' composite attack strategy, which combines prompting, random search, and self-transfer, achieves a 100% attack success rate for all LLMs, surpassing all existing methods. The paper also discusses the effectiveness of different attack methods on different models, including the use of prefilling attacks on Claude models.
The paper also discusses the importance of adaptive attacks for trojan detection, showing that finding universal trojan strings in poisoned models is nearly identical to the standard jailbreaking setting. The authors describe their winning solution for the tro