28 Mar 2024 | Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, Owen Lewis
This paper investigates how language models localize the weights and mechanisms used to memorize and recite entire paragraphs of their training data. The study focuses on the GPT-NEO 125M model, trained on the PILE dataset, to identify which parts of the model are involved in memorization. The research finds that memorized paragraphs are processed differently from non-memorized ones, with gradients showing a distinguishable spatial pattern, being larger in lower layers than in non-memorized examples. The study localizes a low-layer attention head that appears to be especially involved in paragraph memorization, which focuses on distinctive, rare tokens that are least frequent in the corpus-level unigram distribution.
The paper also explores how localized memorization affects the model's ability to generate continuations. It finds that corrupting memorized paragraphs is more difficult than non-memorized ones, and that memorized continuations are harder to unlearn than non-memorized ones. The study uses a contrastive objective to fine-tune the model, focusing on the parameters that have been localized. This approach allows for the unlearning and editing of memorized paragraphs, with unlearning being easier than editing.
The research identifies that the attention head 2 in layer 1 is particularly involved in memorization, showing a strong correlation with rare tokens. The study also finds that the memorization head is negatively correlated with the corpus-level frequency of tokens, indicating that it focuses on distinctive or rare tokens. The findings suggest that memorization is often localized to few, distinctive tokens in the prefix, which are predominantly processed by the attention head 2 in layer 1 of GPT-NEO 125M. The study highlights the importance of understanding how language models memorize and recite training data, with implications for model performance, copyright, and privacy.This paper investigates how language models localize the weights and mechanisms used to memorize and recite entire paragraphs of their training data. The study focuses on the GPT-NEO 125M model, trained on the PILE dataset, to identify which parts of the model are involved in memorization. The research finds that memorized paragraphs are processed differently from non-memorized ones, with gradients showing a distinguishable spatial pattern, being larger in lower layers than in non-memorized examples. The study localizes a low-layer attention head that appears to be especially involved in paragraph memorization, which focuses on distinctive, rare tokens that are least frequent in the corpus-level unigram distribution.
The paper also explores how localized memorization affects the model's ability to generate continuations. It finds that corrupting memorized paragraphs is more difficult than non-memorized ones, and that memorized continuations are harder to unlearn than non-memorized ones. The study uses a contrastive objective to fine-tune the model, focusing on the parameters that have been localized. This approach allows for the unlearning and editing of memorized paragraphs, with unlearning being easier than editing.
The research identifies that the attention head 2 in layer 1 is particularly involved in memorization, showing a strong correlation with rare tokens. The study also finds that the memorization head is negatively correlated with the corpus-level frequency of tokens, indicating that it focuses on distinctive or rare tokens. The findings suggest that memorization is often localized to few, distinctive tokens in the prefix, which are predominantly processed by the attention head 2 in layer 1 of GPT-NEO 125M. The study highlights the importance of understanding how language models memorize and recite training data, with implications for model performance, copyright, and privacy.