JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

2024 | Patrick Chao*, Edoardo Debenedetti*, Alexander Robey*, Maksym Andriushchenko*, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, Eric Wong
JailbreakBench is an open-source benchmark designed to evaluate the robustness of large language models (LLMs) against jailbreaking attacks. It addresses challenges in current evaluation methods, including lack of standardized practices, inconsistent metrics, and limited reproducibility. The benchmark includes an evolving repository of adversarial prompts (jailbreak artifacts), a dataset of 100 harmful and benign behaviors aligned with OpenAI's policies, a standardized evaluation framework, and a leaderboard tracking attack and defense performance across LLMs. Key components include a standardized red-teaming pipeline, a framework for testing and adding new defenses, and a jailbreaking classifier selection process. The benchmark also provides a reproducible evaluation framework and a web-based leaderboard to track performance. It emphasizes reproducibility, extensibility, and accessibility, allowing researchers to test and compare attacks and defenses. The JBB-Behaviors dataset includes 100 harmful and benign behaviors, with each harmful behavior matched by a benign counterpart for evaluation. The benchmark supports both open-source and closed-source LLMs and encourages community contributions. The evaluation of current attacks and defenses shows that even recent models are highly vulnerable to jailbreaking, while some defenses reduce success rates. The benchmark also considers ethical implications and aims to promote safer LLMs through open-sourced jailbreak artifacts. The project is designed to be community-driven, with periodic updates to reflect evolving standards and practices in the field.JailbreakBench is an open-source benchmark designed to evaluate the robustness of large language models (LLMs) against jailbreaking attacks. It addresses challenges in current evaluation methods, including lack of standardized practices, inconsistent metrics, and limited reproducibility. The benchmark includes an evolving repository of adversarial prompts (jailbreak artifacts), a dataset of 100 harmful and benign behaviors aligned with OpenAI's policies, a standardized evaluation framework, and a leaderboard tracking attack and defense performance across LLMs. Key components include a standardized red-teaming pipeline, a framework for testing and adding new defenses, and a jailbreaking classifier selection process. The benchmark also provides a reproducible evaluation framework and a web-based leaderboard to track performance. It emphasizes reproducibility, extensibility, and accessibility, allowing researchers to test and compare attacks and defenses. The JBB-Behaviors dataset includes 100 harmful and benign behaviors, with each harmful behavior matched by a benign counterpart for evaluation. The benchmark supports both open-source and closed-source LLMs and encourages community contributions. The evaluation of current attacks and defenses shows that even recent models are highly vulnerable to jailbreaking, while some defenses reduce success rates. The benchmark also considers ethical implications and aims to promote safer LLMs through open-sourced jailbreak artifacts. The project is designed to be community-driven, with periodic updates to reflect evolving standards and practices in the field.
Reach us at info@study.space