The Crescendo Multi-Turn LLM Jailbreak Attack

The Crescendo Multi-Turn LLM Jailbreak Attack

2 Apr 2024 | Mark Russinovich, Ahmed Salem, Ronen Eldan
The paper introduces a novel multi-turn jailbreak attack called Crescendo, which aims to overcome the safety alignment of large language models (LLMs) by gradually steering them towards performing illegal or unethical tasks. Unlike existing jailbreak methods, Crescendo interacts with the model in a seemingly benign manner, starting with a general prompt and then progressively escalating the dialogue by referencing the model's replies. The authors evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini Ultra, LLaMA-2 70b Chat, and Anthropic Chat, demonstrating its strong efficacy with high attack success rates across all evaluated models and tasks. To automate the Crescendo attack, the authors introduce Crescendomation, a tool that leverages an LLM to generate Crescendo jailbreaks. Crescendomation takes a target task and API access to a model as inputs and initiates conversations aimed at jailbreaking the model into performing the task. The tool incorporates multiple input sources, such as a feedback loop that assesses the quality of the output and whether the model is refusing to respond, to refine its questions. The evaluation of Crescendomation shows its effectiveness against state-of-the-art models, achieving high attack success rates in most cases. The paper also discusses the limitations and potential mitigations of Crescendo, highlighting the need for better alignment and robustness in LLMs to resist such attacks.The paper introduces a novel multi-turn jailbreak attack called Crescendo, which aims to overcome the safety alignment of large language models (LLMs) by gradually steering them towards performing illegal or unethical tasks. Unlike existing jailbreak methods, Crescendo interacts with the model in a seemingly benign manner, starting with a general prompt and then progressively escalating the dialogue by referencing the model's replies. The authors evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini Ultra, LLaMA-2 70b Chat, and Anthropic Chat, demonstrating its strong efficacy with high attack success rates across all evaluated models and tasks. To automate the Crescendo attack, the authors introduce Crescendomation, a tool that leverages an LLM to generate Crescendo jailbreaks. Crescendomation takes a target task and API access to a model as inputs and initiates conversations aimed at jailbreaking the model into performing the task. The tool incorporates multiple input sources, such as a feedback loop that assesses the quality of the output and whether the model is refusing to respond, to refine its questions. The evaluation of Crescendomation shows its effectiveness against state-of-the-art models, achieving high attack success rates in most cases. The paper also discusses the limitations and potential mitigations of Crescendo, highlighting the need for better alignment and robustness in LLMs to resist such attacks.
Reach us at info@study.space