7 May 2025 | USVSN Sai Prashanth*, Alvin Deng*, Kyle O'Brien*, Jyothir S V*, Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra
The paper "Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon" by USVSN Sai Prashanth et al. explores the multifaceted nature of memorization in language models (LMs). The authors propose a taxonomy to categorize memorized data into three types: recitation, reconstruction, and recollection. Recitation refers to highly duplicated sequences, reconstruction to inherently predictable sequences, and recollection to sequences that are neither. The taxonomy is used to construct a predictive model for memorization, which outperforms simpler models. The study also examines the scaling factors in memorization, finding that larger models and longer training times increase the number of memorized sequences, with recollection showing the fastest growth. The paper discusses the implications of these findings for various motivations, such as copyright, privacy, and scientific understanding of generalization. The authors conclude by highlighting the potential of their taxonomy for interpreting complex phenomena in deep learning and other fields.The paper "Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon" by USVSN Sai Prashanth et al. explores the multifaceted nature of memorization in language models (LMs). The authors propose a taxonomy to categorize memorized data into three types: recitation, reconstruction, and recollection. Recitation refers to highly duplicated sequences, reconstruction to inherently predictable sequences, and recollection to sequences that are neither. The taxonomy is used to construct a predictive model for memorization, which outperforms simpler models. The study also examines the scaling factors in memorization, finding that larger models and longer training times increase the number of memorized sequences, with recollection showing the fastest growth. The paper discusses the implications of these findings for various motivations, such as copyright, privacy, and scientific understanding of generalization. The authors conclude by highlighting the potential of their taxonomy for interpreting complex phenomena in deep learning and other fields.