This paper introduces a machine unlearning method called *selective pruning* specifically designed for Large Language Models (LLMs). The method removes neurons based on their relative importance to a targeted capability compared to overall network performance, aiming to identify and remove neurons that enable specific behaviors. The approach is compute- and data-efficient, making it suitable for identifying and removing neurons that enable specific behaviors.
The authors evaluate their method by selectively removing coding ability from LLMs, using the coding dataset 'CodeParrot GitHub Code (all-all)' and the retain dataset 'Python'. They find that both feed-forward and attention neurons in LLMs are specialized, meaning certain neurons are more crucial for specific tasks. The results show that removing these specialized neurons significantly decreases performance on the forget dataset while barely affecting performance on the retain dataset.
The paper also discusses the effectiveness of pruning feed-forward neurons versus attention neurons, finding that feed-forward neurons are more specialized and thus more effective for selective pruning. Additionally, the authors compare their method to existing machine unlearning methods, demonstrating its superior performance in terms of differential drop in accuracy and perplexity.
The paper concludes by highlighting the limitations of the method, such as its inability to remove capabilities when no specific dataset is available and the reliance on the separability of the capabilities. Future work includes exploring the relationship between retained skills and investigating the potential benefits of adding dropout to attention neurons during training.This paper introduces a machine unlearning method called *selective pruning* specifically designed for Large Language Models (LLMs). The method removes neurons based on their relative importance to a targeted capability compared to overall network performance, aiming to identify and remove neurons that enable specific behaviors. The approach is compute- and data-efficient, making it suitable for identifying and removing neurons that enable specific behaviors.
The authors evaluate their method by selectively removing coding ability from LLMs, using the coding dataset 'CodeParrot GitHub Code (all-all)' and the retain dataset 'Python'. They find that both feed-forward and attention neurons in LLMs are specialized, meaning certain neurons are more crucial for specific tasks. The results show that removing these specialized neurons significantly decreases performance on the forget dataset while barely affecting performance on the retain dataset.
The paper also discusses the effectiveness of pruning feed-forward neurons versus attention neurons, finding that feed-forward neurons are more specialized and thus more effective for selective pruning. Additionally, the authors compare their method to existing machine unlearning methods, demonstrating its superior performance in terms of differential drop in accuracy and perplexity.
The paper concludes by highlighting the limitations of the method, such as its inability to remove capabilities when no specific dataset is available and the reliance on the separability of the capabilities. Future work includes exploring the relationship between retained skills and investigating the potential benefits of adding dropout to attention neurons during training.