A STRONGREJECT for Empty Jailbreaks

A STRONGREJECT for Empty Jailbreaks

2024 | Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Tover
A StrongREJECT for Empty Jailbreaks The rise of large language models (LLMs) has raised concerns about "jailbreaks" that allow models to be used maliciously. However, there is no standard benchmark for measuring the severity of jailbreaks, leading authors to create their own. This paper shows that existing benchmarks often include vague or unanswerable questions and use grading criteria that overestimate the misuse potential of low-quality responses. Some jailbreak techniques reduce the quality of model responses even on benign questions, as shown by the substantial decrease in GPT-4's zero-shot performance on MMLU after applying certain jailbreaks. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. Our new grading scheme better aligns with human judgment of response quality and overall jailbreak effectiveness, especially on low-quality responses that contribute to overestimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject. StrongREJECT includes 346 forbidden questions across six categories, along with a subset of 50 questions for cost-constrained experiments. Our autograder uses GPT-4 to evaluate responses on refusal, specificity, and convincingness. The benchmark addresses the shortcomings of existing jailbreak benchmarks by providing a more balanced picture of jailbreak effectiveness. StrongREJECT is less biased than existing autograders and provides more accurate scores. It consistently identifies harmless responses and accurately assesses partially jailbroken responses. StrongREJECT is robustly accurate across jailbreak methods and provides accurate rankings of jailbreak methods. It also shows that jailbreaks can hurt model performance and harm MMLU performance. The paper highlights the importance of accurate jailbreak evaluations and offers researchers a robust benchmark to achieve this goal. The impact of jailbreak research is significant, as it helps understand the weaknesses of large language models and identify vulnerabilities to be patched by vendors. However, there are risks associated with the paper, including the potential misuse of the benchmark by malicious actors. Despite these risks, the positive impact of providing researchers with an improved evaluation for jailbreaking outweighs the potential risks.A StrongREJECT for Empty Jailbreaks The rise of large language models (LLMs) has raised concerns about "jailbreaks" that allow models to be used maliciously. However, there is no standard benchmark for measuring the severity of jailbreaks, leading authors to create their own. This paper shows that existing benchmarks often include vague or unanswerable questions and use grading criteria that overestimate the misuse potential of low-quality responses. Some jailbreak techniques reduce the quality of model responses even on benign questions, as shown by the substantial decrease in GPT-4's zero-shot performance on MMLU after applying certain jailbreaks. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. Our new grading scheme better aligns with human judgment of response quality and overall jailbreak effectiveness, especially on low-quality responses that contribute to overestimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject. StrongREJECT includes 346 forbidden questions across six categories, along with a subset of 50 questions for cost-constrained experiments. Our autograder uses GPT-4 to evaluate responses on refusal, specificity, and convincingness. The benchmark addresses the shortcomings of existing jailbreak benchmarks by providing a more balanced picture of jailbreak effectiveness. StrongREJECT is less biased than existing autograders and provides more accurate scores. It consistently identifies harmless responses and accurately assesses partially jailbroken responses. StrongREJECT is robustly accurate across jailbreak methods and provides accurate rankings of jailbreak methods. It also shows that jailbreaks can hurt model performance and harm MMLU performance. The paper highlights the importance of accurate jailbreak evaluations and offers researchers a robust benchmark to achieve this goal. The impact of jailbreak research is significant, as it helps understand the weaknesses of large language models and identify vulnerabilities to be patched by vendors. However, there are risks associated with the paper, including the potential misuse of the benchmark by malicious actors. Despite these risks, the positive impact of providing researchers with an improved evaluation for jailbreaking outweighs the potential risks.
Reach us at info@study.space