15 Feb 2024 | Alexandra Souly * 1 Qingyuan Lu * 1 Dillon Bowen * 1 Tu Trinh † 1 Elvis Hsieh † 1 Sana Pandey 1 Pieter Abbeel 1 Justin Svegliato 1 Scott Emmons 1 Olivia Watkins 1 Sam Toyer 1
The paper addresses the issue of "jailbreaks" in large language models (LLMs), which allow models to be used for malicious purposes. It highlights the lack of a standardized benchmark for measuring the severity of jailbreaks, leading to inconsistent and potentially biased evaluations. The authors propose a new benchmark called StrongREJECT, which aims to improve the accuracy and fairness of jailbreak evaluations. StrongREJECT includes a curated set of high-quality forbidden questions and an advanced autograding system that evaluates responses on refusal, specificity, and convincingness. The paper demonstrates that existing benchmarks often suffer from issues such as vague or unanswerable questions and biased grading criteria, which can overestimate the effectiveness of jailbreaks. StrongREJECT is shown to provide more accurate and balanced evaluations, aligning better with human judgment. The authors also find that some jailbreak techniques can degrade model performance, even on benign tasks, and that these techniques can make it harder for open-source models to elicit harmful responses. The paper concludes by emphasizing the importance of accurate jailbreak evaluations and the potential impact of StrongREJECT on research and safety.The paper addresses the issue of "jailbreaks" in large language models (LLMs), which allow models to be used for malicious purposes. It highlights the lack of a standardized benchmark for measuring the severity of jailbreaks, leading to inconsistent and potentially biased evaluations. The authors propose a new benchmark called StrongREJECT, which aims to improve the accuracy and fairness of jailbreak evaluations. StrongREJECT includes a curated set of high-quality forbidden questions and an advanced autograding system that evaluates responses on refusal, specificity, and convincingness. The paper demonstrates that existing benchmarks often suffer from issues such as vague or unanswerable questions and biased grading criteria, which can overestimate the effectiveness of jailbreaks. StrongREJECT is shown to provide more accurate and balanced evaluations, aligning better with human judgment. The authors also find that some jailbreak techniques can degrade model performance, even on benign tasks, and that these techniques can make it harder for open-source models to elicit harmful responses. The paper concludes by emphasizing the importance of accurate jailbreak evaluations and the potential impact of StrongREJECT on research and safety.