3 Jan 2024 | Michelle Lo, Shay B. Cohen, Fazl Barez
Large language models (LLMs) demonstrate neuroplasticity, the ability to relearn and redistribute concepts after pruning. This study investigates how LLMs regain performance by relocating advanced concepts to earlier layers and reallocating pruned concepts to neurons with similar semantics. The findings show that models can quickly recover performance post-pruning, indicating their polysemantic capacities and ability to blend old and new concepts in individual neurons. While neuron pruning provides interpretability, it also highlights challenges in permanently removing concepts for improved safety. Monitoring concept reemergence and mitigating unsafe relearning are crucial for robust model editing.
The study focuses on named entity recognition (NER) tasks, pruning concept neurons and retraining models to regain performance. Results show that pruned concepts are remapped to earlier layers, and neurons that recover these concepts were previously primed for relearning. Neurons exhibit polysemantic properties, relearning a blend of new and old concepts. Concept saliency and similarity analyses reveal that pruned concepts are relearned alongside original ones, with later layers initially having the highest saliency but later layers showing increased saliency after retraining.
Performance recovery after pruning is rapid, with models regaining performance within a few epochs. The study also shows that neuroplasticity allows concepts to be redistributed across layers, with middle and later layers showing increased saliency. Concept similarity scores indicate that neurons relearn both the pruned concept and previously captured concepts. Analysis of highest activating tokens (HATs) confirms that neurons relearn new and old concepts, with some neurons capturing both aspects of the concept.
The findings contribute to understanding how LLMs learn, adapt, and retain core conceptual representations. They suggest that earlier layers can recapture fundamental representations, with implications for model editing. The study also highlights the importance of understanding how neuroplasticity increases polysemanticity, which can inform strategies for enhancing representation transfer and interpretability. Overall, the study demonstrates the resilience and fluidity of concept representations in LLMs after concept removal.Large language models (LLMs) demonstrate neuroplasticity, the ability to relearn and redistribute concepts after pruning. This study investigates how LLMs regain performance by relocating advanced concepts to earlier layers and reallocating pruned concepts to neurons with similar semantics. The findings show that models can quickly recover performance post-pruning, indicating their polysemantic capacities and ability to blend old and new concepts in individual neurons. While neuron pruning provides interpretability, it also highlights challenges in permanently removing concepts for improved safety. Monitoring concept reemergence and mitigating unsafe relearning are crucial for robust model editing.
The study focuses on named entity recognition (NER) tasks, pruning concept neurons and retraining models to regain performance. Results show that pruned concepts are remapped to earlier layers, and neurons that recover these concepts were previously primed for relearning. Neurons exhibit polysemantic properties, relearning a blend of new and old concepts. Concept saliency and similarity analyses reveal that pruned concepts are relearned alongside original ones, with later layers initially having the highest saliency but later layers showing increased saliency after retraining.
Performance recovery after pruning is rapid, with models regaining performance within a few epochs. The study also shows that neuroplasticity allows concepts to be redistributed across layers, with middle and later layers showing increased saliency. Concept similarity scores indicate that neurons relearn both the pruned concept and previously captured concepts. Analysis of highest activating tokens (HATs) confirms that neurons relearn new and old concepts, with some neurons capturing both aspects of the concept.
The findings contribute to understanding how LLMs learn, adapt, and retain core conceptual representations. They suggest that earlier layers can recapture fundamental representations, with implications for model editing. The study also highlights the importance of understanding how neuroplasticity increases polysemanticity, which can inform strategies for enhancing representation transfer and interpretability. Overall, the study demonstrates the resilience and fluidity of concept representations in LLMs after concept removal.