Understanding Towards Safer Large Language Models through Machine Unlearning

This paper introduces Selective Knowledge Negation Unlearning (SKU), a novel two-stage framework for Large Language Models (LLMs) to effectively remove harmful knowledge while preserving utility on normal prompts. LLMs, though powerful, can generate harmful outputs when faced with problematic prompts. Existing methods, such as gradient ascent, can reduce harmful outputs but often degrade model performance on normal prompts. SKU addresses this by selectively isolating and removing harmful knowledge from model parameters, ensuring the model remains effective on normal prompts. SKU consists of two stages: (1) Harmful Knowledge Acquisition, where the model learns harmful knowledge from the dataset, and (2) Knowledge Negation, where this harmful knowledge is systematically removed. The first stage includes three modules: a guided distortion module to learn harmful responses, a random disassociation module to diversify harmful knowledge, and a preservation divergence module to maintain model performance on normal prompts. The second stage applies negation to the harmful knowledge, resulting in a non-harmful LLM that retains satisfactory utility. Experiments across various LLM architectures demonstrate that SKU effectively balances the removal of harmful information with the preservation of model utility. SKU outperforms existing methods in reducing harmful outputs while maintaining performance on normal prompts. Ablation studies show that each module contributes to the balance between unlearning and utility. SKU is able to achieve a good balance between unlearning and utility, as it can obtain a very low harmful rate alongside satisfactory performance on normal prompts. The paper also discusses related work in machine unlearning and task vectors, highlighting the importance of diversification in the unlearning process. The results show that SKU is effective in unlearning harmful knowledge while maintaining model utility, making it a promising approach for improving the safety and reliability of LLMs.This paper introduces Selective Knowledge Negation Unlearning (SKU), a novel two-stage framework for Large Language Models (LLMs) to effectively remove harmful knowledge while preserving utility on normal prompts. LLMs, though powerful, can generate harmful outputs when faced with problematic prompts. Existing methods, such as gradient ascent, can reduce harmful outputs but often degrade model performance on normal prompts. SKU addresses this by selectively isolating and removing harmful knowledge from model parameters, ensuring the model remains effective on normal prompts. SKU consists of two stages: (1) Harmful Knowledge Acquisition, where the model learns harmful knowledge from the dataset, and (2) Knowledge Negation, where this harmful knowledge is systematically removed. The first stage includes three modules: a guided distortion module to learn harmful responses, a random disassociation module to diversify harmful knowledge, and a preservation divergence module to maintain model performance on normal prompts. The second stage applies negation to the harmful knowledge, resulting in a non-harmful LLM that retains satisfactory utility. Experiments across various LLM architectures demonstrate that SKU effectively balances the removal of harmful information with the preservation of model utility. SKU outperforms existing methods in reducing harmful outputs while maintaining performance on normal prompts. Ablation studies show that each module contributes to the balance between unlearning and utility. SKU is able to achieve a good balance between unlearning and utility, as it can obtain a very low harmful rate alongside satisfactory performance on normal prompts. The paper also discusses related work in machine unlearning and task vectors, highlighting the importance of diversification in the unlearning process. The results show that SKU is effective in unlearning harmful knowledge while maintaining model utility, making it a promising approach for improving the safety and reliability of LLMs.

Towards Safer Large Language Models through Machine Unlearning

5 Jun 2024 | Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang