Detoxifying Large Language Models via Knowledge Editing

Detoxifying Large Language Models via Knowledge Editing

28 May 2024 | Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen
This paper introduces a new benchmark, SafeEdit, to evaluate the detoxification of Large Language Models (LLMs) using knowledge editing. The authors propose a simple yet effective baseline called Detoxifying with Intraoperative Neural Monitoring (DINM) to reduce toxicity in LLMs. They demonstrate that knowledge editing has the potential to efficiently detoxify LLMs with minimal impact on general performance. The SafeEdit benchmark includes nine unsafe categories and powerful attack prompts, along with comprehensive metrics for systematic evaluation. The authors conduct experiments with various knowledge editing approaches, including MEND and Ext-Sub, and find that knowledge editing can effectively detoxify LLMs. They also analyze the internal mechanisms of detoxifying approaches, showing that previous methods like SFT and DPO may only suppress the activation of toxic parameters, while DINM mitigates the toxicity of toxic parameters to a certain extent, making permanent adjustments. The authors also discuss the limitations of their approach, including the need for more robust methods and the potential risks of parameter modification. The paper highlights the importance of detoxifying LLMs to prevent harmful outputs and emphasizes the need for further research in this area. The authors also provide an ethical statement regarding the potential risks of their dataset containing toxic content. Overall, the paper contributes to the development of detoxifying approaches for LLMs and provides insights into the underlying knowledge mechanisms of LLMs.This paper introduces a new benchmark, SafeEdit, to evaluate the detoxification of Large Language Models (LLMs) using knowledge editing. The authors propose a simple yet effective baseline called Detoxifying with Intraoperative Neural Monitoring (DINM) to reduce toxicity in LLMs. They demonstrate that knowledge editing has the potential to efficiently detoxify LLMs with minimal impact on general performance. The SafeEdit benchmark includes nine unsafe categories and powerful attack prompts, along with comprehensive metrics for systematic evaluation. The authors conduct experiments with various knowledge editing approaches, including MEND and Ext-Sub, and find that knowledge editing can effectively detoxify LLMs. They also analyze the internal mechanisms of detoxifying approaches, showing that previous methods like SFT and DPO may only suppress the activation of toxic parameters, while DINM mitigates the toxicity of toxic parameters to a certain extent, making permanent adjustments. The authors also discuss the limitations of their approach, including the need for more robust methods and the potential risks of parameter modification. The paper highlights the importance of detoxifying LLMs to prevent harmful outputs and emphasizes the need for further research in this area. The authors also provide an ethical statement regarding the potential risks of their dataset containing toxic content. Overall, the paper contributes to the development of detoxifying approaches for LLMs and provides insights into the underlying knowledge mechanisms of LLMs.
Reach us at info@study.space
[slides] Detoxifying Large Language Models via Knowledge Editing | StudySpace