28 Mar 2024 | Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, Owen Lewis
This paper explores the localization of paragraph memorization in language models, specifically focusing on the GPT-Neo 125M model trained on the PILE dataset. The authors investigate how memorized paragraphs are processed internally by the model and identify key mechanisms involved in this process. Key findings include:
1. **Gradients and Memorization**: Memorized paragraphs exhibit distinct spatial patterns in gradients, with higher gradients in lower model layers compared to non-memorized examples. This suggests that memorization is spread across multiple layers and components but is particularly strong in lower layers.
2. **Unlearning and Editing**: Memorized examples can be unlearned by fine-tuning only the high-gradient weights, indicating that these weights are crucial for memorization. Unlearning is easier than editing, and both processes are more challenging than corrupting non-memorized examples.
3. **Attention Head 2 in Layer 1**: A specific attention head (head 2) in layer 1 is found to be strongly correlated with rare tokens, which are least frequent in the corpus-level unigram distribution. This head is particularly involved in paragraph memorization, focusing on distinctive and rare tokens.
4. **Token Perturbation**: Perturbing tokens in the prefix can significantly disrupt memorized paragraphs, with a few distinctive tokens early in the prefix often corrupting the entire continuation. Non-memorized paragraphs are generally more resilient to perturbation.
5. **Activation Analysis**: Activation gradients and patterns further support the localization of memorization. Head 2 shows strong activation gradients and attention patterns towards rare tokens, suggesting that the model may compute a signature of each paragraph using these rare tokens.
6. **Conclusion and Future Work**: The study highlights the importance of understanding memorization in language models for improving performance and addressing privacy concerns. Future work could explore how to make models memorize non-memorized paragraphs, detect upcoming memorization, and apply these findings to other models and training methods.
- Niklas Stoehr, Pengxiang Cheng, Jin Wang, Daniel Preotiuc-Pietro, and Rajarshi Bhownik. "Localizing Paragraph Memorization in Language Models." ArXiv:2402.17762 [cs].This paper explores the localization of paragraph memorization in language models, specifically focusing on the GPT-Neo 125M model trained on the PILE dataset. The authors investigate how memorized paragraphs are processed internally by the model and identify key mechanisms involved in this process. Key findings include:
1. **Gradients and Memorization**: Memorized paragraphs exhibit distinct spatial patterns in gradients, with higher gradients in lower model layers compared to non-memorized examples. This suggests that memorization is spread across multiple layers and components but is particularly strong in lower layers.
2. **Unlearning and Editing**: Memorized examples can be unlearned by fine-tuning only the high-gradient weights, indicating that these weights are crucial for memorization. Unlearning is easier than editing, and both processes are more challenging than corrupting non-memorized examples.
3. **Attention Head 2 in Layer 1**: A specific attention head (head 2) in layer 1 is found to be strongly correlated with rare tokens, which are least frequent in the corpus-level unigram distribution. This head is particularly involved in paragraph memorization, focusing on distinctive and rare tokens.
4. **Token Perturbation**: Perturbing tokens in the prefix can significantly disrupt memorized paragraphs, with a few distinctive tokens early in the prefix often corrupting the entire continuation. Non-memorized paragraphs are generally more resilient to perturbation.
5. **Activation Analysis**: Activation gradients and patterns further support the localization of memorization. Head 2 shows strong activation gradients and attention patterns towards rare tokens, suggesting that the model may compute a signature of each paragraph using these rare tokens.
6. **Conclusion and Future Work**: The study highlights the importance of understanding memorization in language models for improving performance and addressing privacy concerns. Future work could explore how to make models memorize non-memorized paragraphs, detect upcoming memorization, and apply these findings to other models and training methods.
- Niklas Stoehr, Pengxiang Cheng, Jin Wang, Daniel Preotiuc-Pietro, and Rajarshi Bhownik. "Localizing Paragraph Memorization in Language Models." ArXiv:2402.17762 [cs].