Understanding Break the Breakout%3A Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

The paper "Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement" addresses the vulnerability of language models (LMs) to adversarial misuse, particularly jailbreak attacks. The authors propose a training-free method called self-refinement, which leverages the LM's ability to iteratively refine its responses to improve safety. This method is evaluated against several defense baselines and shown to be the safest training-free approach against jailbreak attacks. Additionally, the paper introduces a formatting technique that enhances the efficiency of the self-refine process while reducing attack success rates. The study also finds that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by providing more helpful and safe responses. The findings suggest that non-safety LM can be effectively utilized in real-world services with reduced safety risks and computational costs. The paper concludes by discussing the limitations of the approach and future directions for improving safety and helpfulness in LM defense.The paper "Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement" addresses the vulnerability of language models (LMs) to adversarial misuse, particularly jailbreak attacks. The authors propose a training-free method called self-refinement, which leverages the LM's ability to iteratively refine its responses to improve safety. This method is evaluated against several defense baselines and shown to be the safest training-free approach against jailbreak attacks. Additionally, the paper introduces a formatting technique that enhances the efficiency of the self-refine process while reducing attack success rates. The study also finds that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by providing more helpful and safe responses. The findings suggest that non-safety LM can be effectively utilized in real-world services with reduced safety risks and computational costs. The paper concludes by discussing the limitations of the approach and future directions for improving safety and helpfulness in LM defense.

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

27 Feb 2024 | Heegyu Kim, Sehyun Yuk, Hyunsouk Cho