GENERALIZATION V.S. MEMORIZATION: TRACING LANGUAGE MODELS' CAPABILITIES BACK TO PRETRAINING DATA

GENERALIZATION V.S. MEMORIZATION: TRACING LANGUAGE MODELS' CAPABILITIES BACK TO PRETRAINING DATA

2025 | Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang
This paper investigates whether large language models (LLMs) generalize to new tasks or rely on memorizing pretraining data. The authors introduce a new concept called distributional memorization, which measures the correlation between LLM output probabilities and pretraining data frequency. They propose a task-gram language model, which captures task-specific pretraining data frequency by counting semantically related n-gram pairs from task inputs and outputs. Using Pythia models trained on the Pile dataset, they evaluate four tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Their findings show that factual question answering exhibits the strongest memorization effect, while machine translation and reasoning tasks show greater generalization. The study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is key for harder, reasoning-based tasks. The task-gram language model provides a scalable method for analyzing large pretraining corpora in greater depth. The results show that as model size increases, performance gains vary across tasks: for factual question answering, improved memorization is key, while for other tasks, increased generalization is more crucial. The study also finds that pretraining documents have the largest impact on factual question answering, followed by world knowledge questions, and the least impact on machine translation. The findings suggest that LLMs rely on memorization for knowledge-intensive tasks and generalization for reasoning-intensive tasks. The paper also explores the influence of pretraining data on LLM predictions and proposes a prompt optimization method based on n-gram counts to improve task performance. The results show that prompts more similar to pretraining data improve performance in knowledge-intensive tasks, while prompts less similar to pretraining data improve performance in reasoning-intensive tasks. The study provides a comprehensive analysis of LLM capabilities and offers a scalable framework for investigating the fine-grained task-relevant characteristics of pretraining corpora.This paper investigates whether large language models (LLMs) generalize to new tasks or rely on memorizing pretraining data. The authors introduce a new concept called distributional memorization, which measures the correlation between LLM output probabilities and pretraining data frequency. They propose a task-gram language model, which captures task-specific pretraining data frequency by counting semantically related n-gram pairs from task inputs and outputs. Using Pythia models trained on the Pile dataset, they evaluate four tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Their findings show that factual question answering exhibits the strongest memorization effect, while machine translation and reasoning tasks show greater generalization. The study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is key for harder, reasoning-based tasks. The task-gram language model provides a scalable method for analyzing large pretraining corpora in greater depth. The results show that as model size increases, performance gains vary across tasks: for factual question answering, improved memorization is key, while for other tasks, increased generalization is more crucial. The study also finds that pretraining documents have the largest impact on factual question answering, followed by world knowledge questions, and the least impact on machine translation. The findings suggest that LLMs rely on memorization for knowledge-intensive tasks and generalization for reasoning-intensive tasks. The paper also explores the influence of pretraining data on LLM predictions and proposes a prompt optimization method based on n-gram counts to improve task performance. The results show that prompts more similar to pretraining data improve performance in knowledge-intensive tasks, while prompts less similar to pretraining data improve performance in reasoning-intensive tasks. The study provides a comprehensive analysis of LLM capabilities and offers a scalable framework for investigating the fine-grained task-relevant characteristics of pretraining corpora.
Reach us at info@study.space
[slides and audio] Generalization v.s. Memorization%3A Tracing Language Models' Capabilities Back to Pretraining Data