RECITE, RECONSTRUCT, RECOLLECT: MEMORIZATION IN LMS AS A MULTIFACETED PHENOMENON

RECITE, RECONSTRUCT, RECOLLECT: MEMORIZATION IN LMS AS A MULTIFACETED PHENOMENON

2025 | USVSN Sai Prashanth*, Alvin Deng*, Kyle O'Brien*, Jyothir S V*, Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra
The paper explores memorization in language models (LMs) as a multifaceted phenomenon, proposing a taxonomy that categorizes memorization into three types: recitation, reconstruction, and recollection. The authors argue that memorization is not a homogeneous process but is influenced by various factors, including sequence duplication, predictability, and model behavior. They introduce a predictive model to assess the likelihood of memorization based on these factors and demonstrate that different taxonomic categories are influenced by distinct factors. The study uses a dataset of sequences memorized by the Pythia language models, analyzing how memorization varies with model size and training time. It finds that larger models and longer training times lead to more memorized data, with recollection showing the fastest growth. The authors also highlight that memorization is not solely determined by sequence duplication but also by other factors such as perplexity and token frequency. The paper presents a taxonomy that differentiates between recitation (highly duplicated sequences), reconstruction (predictable sequences), and recollection (rare sequences). It shows that recitation is enabled by low-perplexity prompts, while recollection is constrained by the presence of rare tokens. The study also demonstrates that the proposed taxonomy improves predictive models for memorization, outperforming both a baseline model and a model optimized for mediating factors. The authors conclude that memorization is a complex phenomenon influenced by multiple factors, and their taxonomy provides a framework for understanding and predicting memorization in LM. The study also highlights the importance of considering different categories of memorization in the context of intellectual property, privacy, and scientific understanding of generalization. The results suggest that future research should explore the interactions between these factors and the underlying mechanisms of memorization in LM.The paper explores memorization in language models (LMs) as a multifaceted phenomenon, proposing a taxonomy that categorizes memorization into three types: recitation, reconstruction, and recollection. The authors argue that memorization is not a homogeneous process but is influenced by various factors, including sequence duplication, predictability, and model behavior. They introduce a predictive model to assess the likelihood of memorization based on these factors and demonstrate that different taxonomic categories are influenced by distinct factors. The study uses a dataset of sequences memorized by the Pythia language models, analyzing how memorization varies with model size and training time. It finds that larger models and longer training times lead to more memorized data, with recollection showing the fastest growth. The authors also highlight that memorization is not solely determined by sequence duplication but also by other factors such as perplexity and token frequency. The paper presents a taxonomy that differentiates between recitation (highly duplicated sequences), reconstruction (predictable sequences), and recollection (rare sequences). It shows that recitation is enabled by low-perplexity prompts, while recollection is constrained by the presence of rare tokens. The study also demonstrates that the proposed taxonomy improves predictive models for memorization, outperforming both a baseline model and a model optimized for mediating factors. The authors conclude that memorization is a complex phenomenon influenced by multiple factors, and their taxonomy provides a framework for understanding and predicting memorization in LM. The study also highlights the importance of considering different categories of memorization in the context of intellectual property, privacy, and scientific understanding of generalization. The results suggest that future research should explore the interactions between these factors and the underlying mechanisms of memorization in LM.
Reach us at info@study.space