Understanding Detoxifying Large Language Models via Knowledge Editing

This paper investigates the use of knowledge editing techniques to detoxify Large Language Models (LLMs). The authors construct a benchmark called SafeEdit, which covers nine unsafe categories with powerful attack prompts and includes comprehensive metrics for systematic evaluation. They conduct experiments with several knowledge editing approaches, demonstrating that knowledge editing can efficiently detoxify LLMs with limited impact on general performance. The paper introduces a simple yet effective baseline method, Detoxifying with Intraoperative Neural Monitoring (DINM), which aims to diminish the toxicity of LLMs within a few tuning steps via a single instance. The authors provide an in-depth analysis of the internal mechanisms of various detoxifying approaches, showing that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of these parameters to a certain extent. The paper also discusses the limitations and future directions of the work, emphasizing the need for further research in this area.This paper investigates the use of knowledge editing techniques to detoxify Large Language Models (LLMs). The authors construct a benchmark called SafeEdit, which covers nine unsafe categories with powerful attack prompts and includes comprehensive metrics for systematic evaluation. They conduct experiments with several knowledge editing approaches, demonstrating that knowledge editing can efficiently detoxify LLMs with limited impact on general performance. The paper introduces a simple yet effective baseline method, Detoxifying with Intraoperative Neural Monitoring (DINM), which aims to diminish the toxicity of LLMs within a few tuning steps via a single instance. The authors provide an in-depth analysis of the internal mechanisms of various detoxifying approaches, showing that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of these parameters to a certain extent. The paper also discusses the limitations and future directions of the work, emphasizing the need for further research in this area.

Detoxifying Large Language Models via Knowledge Editing

28 May 2024 | Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen