Safe Unlearning is a novel approach to defend against jailbreak attacks on large language models (LLMs). Unlike traditional supervised fine-tuning (SFT) methods, which focus on identifying harmful queries, Safe Unlearning directly removes harmful knowledge from the model, leading to significantly better performance in reducing attack success rates (ASR). The method achieves this by training the model with only 20 raw harmful questions without any jailbreak prompts, resulting in a dramatic reduction in ASR from 82.6% to 7.7% on out-of-distribution (OOD) harmful questions with complex jailbreak prompts. This outperforms other methods like Llama2-7B-Chat, which still has an ASR of 21.9% even with additional safety prompts.
The effectiveness of Safe Unlearning stems from the intrinsic relatedness among harmful responses across different harmful questions. Harmful responses often share similar content, steps, and actions, and their representations in the LLM are closely related. By unlearning harmful knowledge, the model is prevented from generating harmful responses, even when confronted with unseen jailbreak prompts. The method also maintains general performance on harmless queries, as demonstrated by its ability to retain instruction-following capabilities on tasks like AlpacaEval.
Extensive experiments show that Safe Unlearning not only generalizes well to trained harmful questions with jailbreak prompts but also to OOD harmful questions with jailbreak prompts. The method uses three complementary objectives: minimizing the probability of generating harmful responses, maximizing the probability of rejecting harmful queries, and maintaining general performance on harmless queries. An adaptive unlearning loss is employed to control the unlearning process, ensuring stable training and effective removal of harmful knowledge.
Compared to other methods like DPO and GA, Safe Unlearning demonstrates superior performance in reducing ASR and maintaining general performance. It is also more data-efficient, requiring only a small number of harmful questions for effective unlearning. The method's strong generalization ability is attributed to the model's ability to group diverse harmful responses, making it effective at unlearning a wide range of harmful knowledge. Overall, Safe Unlearning offers a promising solution for defending against jailbreak attacks by directly removing harmful knowledge from the model.Safe Unlearning is a novel approach to defend against jailbreak attacks on large language models (LLMs). Unlike traditional supervised fine-tuning (SFT) methods, which focus on identifying harmful queries, Safe Unlearning directly removes harmful knowledge from the model, leading to significantly better performance in reducing attack success rates (ASR). The method achieves this by training the model with only 20 raw harmful questions without any jailbreak prompts, resulting in a dramatic reduction in ASR from 82.6% to 7.7% on out-of-distribution (OOD) harmful questions with complex jailbreak prompts. This outperforms other methods like Llama2-7B-Chat, which still has an ASR of 21.9% even with additional safety prompts.
The effectiveness of Safe Unlearning stems from the intrinsic relatedness among harmful responses across different harmful questions. Harmful responses often share similar content, steps, and actions, and their representations in the LLM are closely related. By unlearning harmful knowledge, the model is prevented from generating harmful responses, even when confronted with unseen jailbreak prompts. The method also maintains general performance on harmless queries, as demonstrated by its ability to retain instruction-following capabilities on tasks like AlpacaEval.
Extensive experiments show that Safe Unlearning not only generalizes well to trained harmful questions with jailbreak prompts but also to OOD harmful questions with jailbreak prompts. The method uses three complementary objectives: minimizing the probability of generating harmful responses, maximizing the probability of rejecting harmful queries, and maintaining general performance on harmless queries. An adaptive unlearning loss is employed to control the unlearning process, ensuring stable training and effective removal of harmful knowledge.
Compared to other methods like DPO and GA, Safe Unlearning demonstrates superior performance in reducing ASR and maintaining general performance. It is also more data-efficient, requiring only a small number of harmful questions for effective unlearning. The method's strong generalization ability is attributed to the model's ability to group diverse harmful responses, making it effective at unlearning a wide range of harmful knowledge. Overall, Safe Unlearning offers a promising solution for defending against jailbreak attacks by directly removing harmful knowledge from the model.