Don't Say No: Jailbreaking LLM by Suppressing Refusal

Don't Say No: Jailbreaking LLM by Suppressing Refusal

25 Apr 2024 | Yukai Zhou, Wenjie Wang
This paper introduces the DSN (Don't Say No) attack, which aims to elicit affirmative responses from large language models (LLMs) while suppressing refusal responses. The DSN attack is designed to overcome the limitations of existing jailbreaking methods, such as the GCG attack, which has limited success rates. The DSN attack incorporates a novel objective that not only encourages LLMs to generate affirmative responses but also suppresses refusal responses by minimizing the probability of generating tokens associated with predefined refusal keywords. To evaluate the effectiveness of the DSN attack, the paper proposes an ensemble evaluation pipeline that combines Natural Language Inference (NLI) contradiction assessment with two external LLM evaluators. This approach aims to more accurately assess the success of jailbreaking attacks by considering multiple evaluation metrics. The ensemble evaluation pipeline is shown to be more effective than traditional refusal keyword matching methods, which often result in false positives and false negatives. The paper also presents extensive experiments demonstrating the effectiveness of the DSN attack and the ensemble evaluation approach. The results show that the DSN attack achieves significantly higher success rates compared to the GCG attack, particularly in terms of both average and optimal results. Additionally, the ensemble evaluation pipeline is shown to be more accurate in assessing the success of jailbreaking attacks, as it considers multiple evaluation metrics and reduces the risk of false positives and false negatives. The paper also discusses the challenges of evaluating jailbreaking attacks, including the difficulty of automatically assessing the harmfulness of LLM completions and the limitations of traditional refusal keyword matching methods. The proposed ensemble evaluation pipeline addresses these challenges by incorporating multiple evaluation metrics, including NLI contradiction assessment and third-party LLM evaluators, to provide a more comprehensive and accurate assessment of jailbreaking attacks.This paper introduces the DSN (Don't Say No) attack, which aims to elicit affirmative responses from large language models (LLMs) while suppressing refusal responses. The DSN attack is designed to overcome the limitations of existing jailbreaking methods, such as the GCG attack, which has limited success rates. The DSN attack incorporates a novel objective that not only encourages LLMs to generate affirmative responses but also suppresses refusal responses by minimizing the probability of generating tokens associated with predefined refusal keywords. To evaluate the effectiveness of the DSN attack, the paper proposes an ensemble evaluation pipeline that combines Natural Language Inference (NLI) contradiction assessment with two external LLM evaluators. This approach aims to more accurately assess the success of jailbreaking attacks by considering multiple evaluation metrics. The ensemble evaluation pipeline is shown to be more effective than traditional refusal keyword matching methods, which often result in false positives and false negatives. The paper also presents extensive experiments demonstrating the effectiveness of the DSN attack and the ensemble evaluation approach. The results show that the DSN attack achieves significantly higher success rates compared to the GCG attack, particularly in terms of both average and optimal results. Additionally, the ensemble evaluation pipeline is shown to be more accurate in assessing the success of jailbreaking attacks, as it considers multiple evaluation metrics and reduces the risk of false positives and false negatives. The paper also discusses the challenges of evaluating jailbreaking attacks, including the difficulty of automatically assessing the harmfulness of LLM completions and the limitations of traditional refusal keyword matching methods. The proposed ensemble evaluation pipeline addresses these challenges by incorporating multiple evaluation metrics, including NLI contradiction assessment and third-party LLM evaluators, to provide a more comprehensive and accurate assessment of jailbreaking attacks.
Reach us at info@study.space
[slides] Don't Say No%3A Jailbreaking LLM by Suppressing Refusal | StudySpace