[slides and audio] Large Language Models Relearn Removed Concepts

The paper investigates the neuroplasticity of large language models (LLMs) in relearning pruned concepts after neuron pruning. The authors evaluate this process by tracking concept saliency and similarity in pruned neurons during retraining. Key findings include: 1. **Neuroplasticity and Performance Recovery**: Models can quickly regain performance after pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to neurons with similar semantics. This demonstrates the models' ability to blend old and new concepts in individual neurons. 2. **Concept Saliency and Similarity**: Concept saliency measures how strongly a neuron encodes a specific concept, while concept similarity measures how similar the concept captured by a neuron is to the original concept. The study shows that pruned concepts are redistributed to neurons that previously captured similar concepts, indicating polysemantic properties. 3. **Model Architecture and Pruning**: The experiments are conducted on DistilBERT, DistilGPT2, and GPT2 models fine-tuned for named entity recognition. The results suggest that the recovery of performance is more rapid in earlier layers compared to later layers. 4. **Random Pruning Baseline**: A random pruning baseline is used to verify that the neuroplasticity effects are specific to pruning critical concept neurons. Random pruning results in a more drastic performance drop and slower recovery compared to concept pruning. 5. **Polysemantic Characteristics**: Neurons exhibit polysemantic properties as they relearn a blend of new and old concepts. This is evident from the highest activating tokens (HATs) for individual neurons, which show both new and old concepts being captured after neuroplasticity. 6. **Limitations**: The study is limited to specific models and architectures, and the precise relationship between concept saliency and similarity is not fully clear. The probeless method used for concept saliency computation is less scalable but more computationally efficient. Overall, the research highlights the resilience and fluidity of concept representations in LLMs post-concept removal, emphasizing the need for robust model editing techniques to prevent the relearning of unsafe concepts.The paper investigates the neuroplasticity of large language models (LLMs) in relearning pruned concepts after neuron pruning. The authors evaluate this process by tracking concept saliency and similarity in pruned neurons during retraining. Key findings include: 1. **Neuroplasticity and Performance Recovery**: Models can quickly regain performance after pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to neurons with similar semantics. This demonstrates the models' ability to blend old and new concepts in individual neurons. 2. **Concept Saliency and Similarity**: Concept saliency measures how strongly a neuron encodes a specific concept, while concept similarity measures how similar the concept captured by a neuron is to the original concept. The study shows that pruned concepts are redistributed to neurons that previously captured similar concepts, indicating polysemantic properties. 3. **Model Architecture and Pruning**: The experiments are conducted on DistilBERT, DistilGPT2, and GPT2 models fine-tuned for named entity recognition. The results suggest that the recovery of performance is more rapid in earlier layers compared to later layers. 4. **Random Pruning Baseline**: A random pruning baseline is used to verify that the neuroplasticity effects are specific to pruning critical concept neurons. Random pruning results in a more drastic performance drop and slower recovery compared to concept pruning. 5. **Polysemantic Characteristics**: Neurons exhibit polysemantic properties as they relearn a blend of new and old concepts. This is evident from the highest activating tokens (HATs) for individual neurons, which show both new and old concepts being captured after neuroplasticity. 6. **Limitations**: The study is limited to specific models and architectures, and the precise relationship between concept saliency and similarity is not fully clear. The probeless method used for concept saliency computation is less scalable but more computationally efficient. Overall, the research highlights the resilience and fluidity of concept representations in LLMs post-concept removal, emphasizing the need for robust model editing techniques to prevent the relearning of unsafe concepts.

Large Language Models Relearn Removed Concepts

3 Jan 2024 | Michelle Lo, Shay B. Cohen, Fazl Barez