Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

27 Feb 2024 | Heegyu Kim¹, Sehyun Yuk², Hyunsouk Cho¹,²*
This paper introduces a novel training-free defense method called self-refine to enhance the safety of language models (LMs) against jailbreak attacks. The method leverages the self-refinement capability of LMs to iteratively refine responses and reduce harmful outputs. The study evaluates the effectiveness of self-refine against various jailbreak attacks and compares it with existing defense strategies. The research addresses three key questions: (1) Can self-refine be applied to safety alignment in LMs? (2) Can self-refine be made more effective? (3) Does self-refine degrade helpfulness? The results show that self-refine significantly improves safety against jailbreak attacks, even in non-safety-aligned LMs. It achieves lower attack success rates and better safety outcomes compared to baseline methods like in-context defense, self-reminder, and smoothLLM. The study also explores the impact of formatting techniques on the self-refine process. It finds that formatting, such as JSON and code formatting, enhances the efficiency of self-refine by reducing the number of iterations needed to achieve safe responses. These formatting methods help the LM focus on the refinement task rather than the original jailbreak prompt, leading to faster convergence and improved safety. However, the study also highlights the potential trade-off between safety and helpfulness. While self-refine improves safety, it may slightly reduce the helpfulness of responses. The research demonstrates that non-safety-aligned LMs can outperform safety-aligned LMs in terms of helpfulness, but they are more vulnerable to jailbreak attacks. The findings suggest that self-refine is a promising approach for enhancing the safety of LMs without requiring additional training. It provides a safer alternative to safety-aligned LMs while maintaining a reasonable level of helpfulness. The study also emphasizes the importance of considering multiple metrics, including lexical and practical measures, when evaluating the safety of LMs. Overall, the research contributes to the field of LM safety by proposing a training-free defense method that is effective against jailbreak attacks and can be applied to a wide range of LMs. The results demonstrate that self-refine is a viable solution for improving the safety of LMs in real-world applications.This paper introduces a novel training-free defense method called self-refine to enhance the safety of language models (LMs) against jailbreak attacks. The method leverages the self-refinement capability of LMs to iteratively refine responses and reduce harmful outputs. The study evaluates the effectiveness of self-refine against various jailbreak attacks and compares it with existing defense strategies. The research addresses three key questions: (1) Can self-refine be applied to safety alignment in LMs? (2) Can self-refine be made more effective? (3) Does self-refine degrade helpfulness? The results show that self-refine significantly improves safety against jailbreak attacks, even in non-safety-aligned LMs. It achieves lower attack success rates and better safety outcomes compared to baseline methods like in-context defense, self-reminder, and smoothLLM. The study also explores the impact of formatting techniques on the self-refine process. It finds that formatting, such as JSON and code formatting, enhances the efficiency of self-refine by reducing the number of iterations needed to achieve safe responses. These formatting methods help the LM focus on the refinement task rather than the original jailbreak prompt, leading to faster convergence and improved safety. However, the study also highlights the potential trade-off between safety and helpfulness. While self-refine improves safety, it may slightly reduce the helpfulness of responses. The research demonstrates that non-safety-aligned LMs can outperform safety-aligned LMs in terms of helpfulness, but they are more vulnerable to jailbreak attacks. The findings suggest that self-refine is a promising approach for enhancing the safety of LMs without requiring additional training. It provides a safer alternative to safety-aligned LMs while maintaining a reasonable level of helpfulness. The study also emphasizes the importance of considering multiple metrics, including lexical and practical measures, when evaluating the safety of LMs. Overall, the research contributes to the field of LM safety by proposing a training-free defense method that is effective against jailbreak attacks and can be applied to a wide range of LMs. The results demonstrate that self-refine is a viable solution for improving the safety of LMs in real-world applications.
Reach us at info@study.space