[slides] JailbreakBench%3A An Open Robustness Benchmark for Jailbreaking Large Language Models

JailbreakBench is an open-sourced benchmark designed to evaluate the robustness of large language models (LLMs) against jailbreaking attacks. The benchmark addresses several challenges in the current evaluation landscape, including the lack of clear standards, inconsistent cost and success rate calculations, and the unreproducibility of many studies. Key components of JailbreakBench include: 1. **Jailbreak Artifacts Repository**: An evolving repository of state-of-the-art adversarial prompts, or "jailbreak artifacts," which are essential for reproducible research. 2. **JBB-Behaviors Dataset**: A dataset comprising 100 behaviors, both original and sourced from prior work, aligned with OpenAI’s usage policies. 3. **Standardized Evaluation Framework**: A framework that includes a clear threat model, system prompts, chat templates, and scoring functions. 4. **Leaderboard**: A leaderboard that tracks the performance of attacks and defenses for various LLMs. The benchmark aims to support adaptive attacks and test-time defenses, ensuring reproducibility, extensibility, and accessibility. It provides a standardized red-teaming pipeline and evaluates the effectiveness of defenses using a chosen classifier, Llama-3-70B, which is open-sourced and comparable to GPT-4. The benchmark also includes baseline attacks and defenses for initial evaluation and plans to expand its scope over time. Ethical considerations and limitations are discussed, emphasizing the need for community-driven contributions and the potential for misuse of the released artifacts.JailbreakBench is an open-sourced benchmark designed to evaluate the robustness of large language models (LLMs) against jailbreaking attacks. The benchmark addresses several challenges in the current evaluation landscape, including the lack of clear standards, inconsistent cost and success rate calculations, and the unreproducibility of many studies. Key components of JailbreakBench include: 1. **Jailbreak Artifacts Repository**: An evolving repository of state-of-the-art adversarial prompts, or "jailbreak artifacts," which are essential for reproducible research. 2. **JBB-Behaviors Dataset**: A dataset comprising 100 behaviors, both original and sourced from prior work, aligned with OpenAI’s usage policies. 3. **Standardized Evaluation Framework**: A framework that includes a clear threat model, system prompts, chat templates, and scoring functions. 4. **Leaderboard**: A leaderboard that tracks the performance of attacks and defenses for various LLMs. The benchmark aims to support adaptive attacks and test-time defenses, ensuring reproducibility, extensibility, and accessibility. It provides a standardized red-teaming pipeline and evaluates the effectiveness of defenses using a chosen classifier, Llama-3-70B, which is open-sourced and comparable to GPT-4. The benchmark also includes baseline attacks and defenses for initial evaluation and plans to expand its scope over time. Ethical considerations and limitations are discussed, emphasizing the need for community-driven contributions and the potential for misuse of the released artifacts.

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

16 Jul 2024 | Patrick Chao1, Edoardo Debenedetti2, Alexander Robey1, Maksym Andriushchenko1, Francesco Croce3, Vikash Sehwag4, Edgar Dobriban1, Nicolas Flammarion3, George J. Pappas1, Florian Tramèr1, Hamed Hassani1, Eric Wong1

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

16 Jul 2024 | Patrick Chao*1, Edoardo Debenedetti*2, Alexander Robey*1, Maksym Andriushchenko*1, Francesco Croce3, Vikash Sehwag4, Edgar Dobriban1, Nicolas Flammarion3, George J. Pappas1, Florian Tramèr1, Hamed Hassani1, Eric Wong1

16 Jul 2024 | Patrick Chao1, Edoardo Debenedetti2, Alexander Robey1, Maksym Andriushchenko1, Francesco Croce3, Vikash Sehwag4, Edgar Dobriban1, Nicolas Flammarion3, George J. Pappas1, Florian Tramèr1, Hamed Hassani1, Eric Wong1