DISSECTING LANGUAGE MODELS: MACHINE UNLEARNING VIA SELECTIVE PRUNING

DISSECTING LANGUAGE MODELS: MACHINE UNLEARNING VIA SELECTIVE PRUNING

24 Jul 2024 | Nicholas Pochinkov, Nandi Schoots
This paper introduces a machine unlearning method for Large Language Models (LLMs) called selective pruning. The method selectively removes neurons based on their relative importance to specific tasks, enabling the removal of harmful capabilities while preserving others. The approach is computationally and data-efficient, requiring only a small dataset and minimal computational resources. The study shows that feed-forward and attention neurons in LLMs are specialized, with certain neurons being more crucial for specific tasks. Selective pruning is evaluated on various datasets, including code, Python, and image recognition tasks, demonstrating its effectiveness in reducing performance on targeted datasets while maintaining performance on retained datasets. The method is task-agnostic and can be applied to remove a wide range of potentially harmful skills. The results show that selective pruning is more effective than other unlearning methods, particularly for feed-forward neurons. The study also highlights the importance of neuron specialization in LLMs and suggests that selective pruning could be a valuable tool for improving model safety and controllability. The method is implemented on various LLMs, including Meta's OPT, Galactica, Pythia, and RoBERTa, and is shown to be effective in reducing toxicity and other harmful behaviors. The paper concludes that selective pruning is a viable machine unlearning method, with potential applications in improving model safety and controllability.This paper introduces a machine unlearning method for Large Language Models (LLMs) called selective pruning. The method selectively removes neurons based on their relative importance to specific tasks, enabling the removal of harmful capabilities while preserving others. The approach is computationally and data-efficient, requiring only a small dataset and minimal computational resources. The study shows that feed-forward and attention neurons in LLMs are specialized, with certain neurons being more crucial for specific tasks. Selective pruning is evaluated on various datasets, including code, Python, and image recognition tasks, demonstrating its effectiveness in reducing performance on targeted datasets while maintaining performance on retained datasets. The method is task-agnostic and can be applied to remove a wide range of potentially harmful skills. The results show that selective pruning is more effective than other unlearning methods, particularly for feed-forward neurons. The study also highlights the importance of neuron specialization in LLMs and suggests that selective pruning could be a valuable tool for improving model safety and controllability. The method is implemented on various LLMs, including Meta's OPT, Galactica, Pythia, and RoBERTa, and is shown to be effective in reducing toxicity and other harmful behaviors. The paper concludes that selective pruning is a viable machine unlearning method, with potential applications in improving model safety and controllability.
Reach us at info@study.space