2 Apr 2024 | Mark Russinovich, Ahmed Salem, Ronen Eldan
This paper introduces a novel multi-turn jailbreak attack called Crescendo, which uses benign inputs to bypass the safety alignment of large language models (LLMs). Unlike existing jailbreak methods that rely on adversarial text or optimization-based techniques, Crescendo interacts with the model in a seemingly benign manner, gradually escalating the dialogue to achieve a successful jailbreak. The attack begins with a general prompt or question and then progressively leads the model to generate harmful content in small, seemingly benign steps. This approach makes it harder to detect and defend against, even after being discovered.
Crescendo was evaluated on various public systems, including ChatGPT, Gemini Pro, Gemini Ultra, LLaMA-2 70b Chat, and Anthropic Chat. The results demonstrate the strong efficacy of Crescendo, with high attack success rates across all evaluated models and tasks. Additionally, we introduce Crescendomation, a tool that automates the Crescendo attack. Our evaluation showcases its effectiveness against state-of-the-art models.
Crescendo is a multi-turn jailbreaking technique that uses benign human-readable prompts. It distinguishes itself from other approaches by utilizing the target model's outputs to direct the model towards bypassing its safety alignment. This approach begins with an innocuous topic linked to the target task and progressively intensifies, directing the model's responses towards the intended outcome. Hence, it circumvents defenses and safety measures, especially ones designed to react mainly to the user's prompts.
Crescendo intuitively starts with a question that can serve as a foundation for achieving the objective. To illustrate, Figure 2 presents an example general pattern that successfully generates articles promoting misinformation. In this example, X represents the type of misinformation, i.e., the abstract topic of the jailbreak task. As Figure 2 shows, the user makes no or minimal direct references to the target task.
We evaluate the efficacy of the Crescendo attack by defining a range of tasks spanning various categories that contravene safety guidelines. Table 1 presents the different tasks and their associated categories. We manually execute and evaluate Crescendo on a subset of these tasks, targeting five different state-of-the-art aligned public chat systems and LLMs, including ChatGPT (GPT-4), Gemini (Gemini Pro and Gemini Ultra), Anthropic Chat (Claude-2 and Claude-3) and LLaMA-2 70b. Finally, it is important to note that, to the best of our knowledge, these models have all been subject to some form of alignment process, and the chat services also incorporate safety instructions within their meta prompts.
The findings from our evaluations are summarized in Table 2. The data illustrates that Crescendo can effectively jailbreak all the evaluated models across the vast majority of tasks, demonstrating its strong performance across a spectrum of categories and models. Moreover, to visualize the output, we provide an example of the ManifestThis paper introduces a novel multi-turn jailbreak attack called Crescendo, which uses benign inputs to bypass the safety alignment of large language models (LLMs). Unlike existing jailbreak methods that rely on adversarial text or optimization-based techniques, Crescendo interacts with the model in a seemingly benign manner, gradually escalating the dialogue to achieve a successful jailbreak. The attack begins with a general prompt or question and then progressively leads the model to generate harmful content in small, seemingly benign steps. This approach makes it harder to detect and defend against, even after being discovered.
Crescendo was evaluated on various public systems, including ChatGPT, Gemini Pro, Gemini Ultra, LLaMA-2 70b Chat, and Anthropic Chat. The results demonstrate the strong efficacy of Crescendo, with high attack success rates across all evaluated models and tasks. Additionally, we introduce Crescendomation, a tool that automates the Crescendo attack. Our evaluation showcases its effectiveness against state-of-the-art models.
Crescendo is a multi-turn jailbreaking technique that uses benign human-readable prompts. It distinguishes itself from other approaches by utilizing the target model's outputs to direct the model towards bypassing its safety alignment. This approach begins with an innocuous topic linked to the target task and progressively intensifies, directing the model's responses towards the intended outcome. Hence, it circumvents defenses and safety measures, especially ones designed to react mainly to the user's prompts.
Crescendo intuitively starts with a question that can serve as a foundation for achieving the objective. To illustrate, Figure 2 presents an example general pattern that successfully generates articles promoting misinformation. In this example, X represents the type of misinformation, i.e., the abstract topic of the jailbreak task. As Figure 2 shows, the user makes no or minimal direct references to the target task.
We evaluate the efficacy of the Crescendo attack by defining a range of tasks spanning various categories that contravene safety guidelines. Table 1 presents the different tasks and their associated categories. We manually execute and evaluate Crescendo on a subset of these tasks, targeting five different state-of-the-art aligned public chat systems and LLMs, including ChatGPT (GPT-4), Gemini (Gemini Pro and Gemini Ultra), Anthropic Chat (Claude-2 and Claude-3) and LLaMA-2 70b. Finally, it is important to note that, to the best of our knowledge, these models have all been subject to some form of alignment process, and the chat services also incorporate safety instructions within their meta prompts.
The findings from our evaluations are summarized in Table 2. The data illustrates that Crescendo can effectively jailbreak all the evaluated models across the vast majority of tasks, demonstrating its strong performance across a spectrum of categories and models. Moreover, to visualize the output, we provide an example of the Manifest