Understanding Don't Say No%3A Jailbreaking LLM by Suppressing Refusal

The paper introduces the *DSN* (Don’t Say No) attack, a novel method to enhance the jailbreaking of Large Language Models (LLMs). Unlike traditional jailbreak attacks that focus on eliciting affirmative responses, *DSN* aims to suppress refusals while generating affirmative responses. The attack is designed to overcome the limitations of existing methods, such as the $GCG$ attack, which has a low success rate and relies on discrete input spaces. To evaluate the effectiveness of *DSN*, the authors propose an ensemble evaluation pipeline that incorporates Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators (GPT-4 and HarmBench). This approach aims to provide a more accurate and reliable assessment of the attack's success compared to traditional methods like refusal keyword matching, which often suffer from false positives and negatives. Experiments on the AdvBench dataset and two state-of-the-art LLMs, Llama-2-Chat-7B and Vicuna-7b-v1.3, demonstrate the superior performance of *DSN* over the baseline $GCG$ attack. The ensemble evaluation pipeline also shows its effectiveness in accurately assessing the attack's success, outperforming traditional methods in terms of accuracy and robustness. The paper concludes by highlighting the importance of advancing safety alignment mechanisms for LLMs and contributing to the robustness of these systems against malicious manipulations. However, it also acknowledges the ethical considerations and limitations of the proposed methods, particularly in terms of readability and the need for more sophisticated aggregation methodologies.The paper introduces the *DSN* (Don’t Say No) attack, a novel method to enhance the jailbreaking of Large Language Models (LLMs). Unlike traditional jailbreak attacks that focus on eliciting affirmative responses, *DSN* aims to suppress refusals while generating affirmative responses. The attack is designed to overcome the limitations of existing methods, such as the $GCG$ attack, which has a low success rate and relies on discrete input spaces. To evaluate the effectiveness of *DSN*, the authors propose an ensemble evaluation pipeline that incorporates Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators (GPT-4 and HarmBench). This approach aims to provide a more accurate and reliable assessment of the attack's success compared to traditional methods like refusal keyword matching, which often suffer from false positives and negatives. Experiments on the AdvBench dataset and two state-of-the-art LLMs, Llama-2-Chat-7B and Vicuna-7b-v1.3, demonstrate the superior performance of *DSN* over the baseline $GCG$ attack. The ensemble evaluation pipeline also shows its effectiveness in accurately assessing the attack's success, outperforming traditional methods in terms of accuracy and robustness. The paper concludes by highlighting the importance of advancing safety alignment mechanisms for LLMs and contributing to the robustness of these systems against malicious manipulations. However, it also acknowledges the ethical considerations and limitations of the proposed methods, particularly in terms of readability and the need for more sophisticated aggregation methodologies.

Don’t Say No: Jailbreaking LLM by Suppressing Refusal

25 Apr 2024 | Yukai Zhou, Wenjie Wang