GENERALIZATION v.S. MEMORIZATION: TRACING LANGUAGE MODELS’ CAPABILITIES BACK TO PRETRAINING DATA

GENERALIZATION v.S. MEMORIZATION: TRACING LANGUAGE MODELS’ CAPABILITIES BACK TO PRETRAINING DATA

2 Mar 2025 | Xinyi Wang1*, Antonis Antoniades1*, Yanai Elazar2,3, Alfonso Amayuelas1, Alon Albakal4, Kexun Zhang5, William Yang Wang1
The paper explores the debate over whether large language models (LLMs) primarily rely on memorization or genuinely generalize to unseen tasks. To address this, the authors introduce the concept of distributional memorization, which measures the correlation between LLM output probabilities and the frequency of pretraining data. They propose a novel task-gram language model that captures task-specific pretraining data frequency by counting semantically related n-gram pairs from task inputs and outputs in the pretraining corpus. Using Pythia models trained on the Pile dataset, the authors evaluate four tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. The findings reveal varying levels of memorization, with factual question answering showing the strongest effect. While model performance improves across all tasks with larger LLM sizes, only factual question answering exhibits increased memorization, while machine translation and reasoning tasks show greater generalization. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is crucial for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora.The paper explores the debate over whether large language models (LLMs) primarily rely on memorization or genuinely generalize to unseen tasks. To address this, the authors introduce the concept of distributional memorization, which measures the correlation between LLM output probabilities and the frequency of pretraining data. They propose a novel task-gram language model that captures task-specific pretraining data frequency by counting semantically related n-gram pairs from task inputs and outputs in the pretraining corpus. Using Pythia models trained on the Pile dataset, the authors evaluate four tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. The findings reveal varying levels of memorization, with factual question answering showing the strongest effect. While model performance improves across all tasks with larger LLM sizes, only factual question answering exhibits increased memorization, while machine translation and reasoning tasks show greater generalization. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is crucial for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora.
Reach us at info@study.space
[slides and audio] Generalization v.s. Memorization%3A Tracing Language Models' Capabilities Back to Pretraining Data