Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

3 Jul 2024 | Weikai Lu, Ziqian Zeng*, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen
This paper proposes a novel jailbreaking defense method called Eraser for large language models (LLMs), which aims to remove harmful knowledge while retaining general knowledge and maintaining safety alignment. The method is based on unlearning harmful knowledge through gradient ascent, while also preserving the model's ability to understand entities and reject harmful queries. The key components of Eraser include unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The method is evaluated on various datasets and shows significant improvements in reducing jailbreaking success rates without compromising the model's general capabilities. The results indicate that Eraser can effectively reduce the risk of jailbreaking while maintaining the model's ability to handle general tasks. The paper also discusses the limitations of the method, including its inefficiency in defending against a wide range of harmful issues and its applicability only to LLMs that have undergone safety alignment. The authors emphasize the importance of maintaining general capabilities in LLMs to ensure their ethical and responsible use.This paper proposes a novel jailbreaking defense method called Eraser for large language models (LLMs), which aims to remove harmful knowledge while retaining general knowledge and maintaining safety alignment. The method is based on unlearning harmful knowledge through gradient ascent, while also preserving the model's ability to understand entities and reject harmful queries. The key components of Eraser include unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The method is evaluated on various datasets and shows significant improvements in reducing jailbreaking success rates without compromising the model's general capabilities. The results indicate that Eraser can effectively reduce the risk of jailbreaking while maintaining the model's ability to handle general tasks. The paper also discusses the limitations of the method, including its inefficiency in defending against a wide range of harmful issues and its applicability only to LLMs that have undergone safety alignment. The authors emphasize the importance of maintaining general capabilities in LLMs to ensure their ethical and responsible use.
Reach us at info@study.space