17 Jun 2024 | Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo
This paper investigates how large language models (LLMs) acquire factual knowledge during pretraining. The study reveals several key insights into the dynamics of factual knowledge acquisition:
1. **Counterintuitive Data Impact**: Pretraining on more data does not significantly improve the model's ability to acquire and maintain factual knowledge.
2. **Power-Law Relationship**: There is a power-law relationship between training steps and the forgetting of memorized and generalized factual knowledge. LLMs trained with duplicated training data exhibit faster forgetting.
3. **Batch Size Effect**: Training LLMs with larger batch sizes enhances their robustness to forgetting.
The findings suggest that factual knowledge acquisition in LLM pretraining occurs through progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. The study also provides plausible explanations for observed behaviors, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
The research contributes to a better understanding of the dynamics of factual knowledge acquisition in LLMs, which can help in improving their performance and making more effective use of LLMs.This paper investigates how large language models (LLMs) acquire factual knowledge during pretraining. The study reveals several key insights into the dynamics of factual knowledge acquisition:
1. **Counterintuitive Data Impact**: Pretraining on more data does not significantly improve the model's ability to acquire and maintain factual knowledge.
2. **Power-Law Relationship**: There is a power-law relationship between training steps and the forgetting of memorized and generalized factual knowledge. LLMs trained with duplicated training data exhibit faster forgetting.
3. **Batch Size Effect**: Training LLMs with larger batch sizes enhances their robustness to forgetting.
The findings suggest that factual knowledge acquisition in LLM pretraining occurs through progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. The study also provides plausible explanations for observed behaviors, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
The research contributes to a better understanding of the dynamics of factual knowledge acquisition in LLMs, which can help in improving their performance and making more effective use of LLMs.