17 Jun 2024 | Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo
This paper investigates how large language models (LLMs) acquire factual knowledge during pretraining. The study reveals that factual knowledge acquisition occurs through the gradual accumulation of small probability increases from training data, but this is diluted by subsequent forgetting. Key findings include: (1) Pretraining on more data does not significantly improve factual knowledge acquisition. (2) There is a power-law relationship between training steps and forgetting of factual knowledge, with duplicated training data leading to faster forgetting. (3) Larger batch sizes enhance the models' robustness to forgetting. (4) The effectivity of factual knowledge acquisition is higher in larger models. (5) Deduplication of training data and larger batch sizes improve factual knowledge retention. The study also explains recent behaviors of LLMs, such as poor performance on long-tail knowledge and the importance of deduplication. The results suggest that factual knowledge acquisition in LLMs is a dynamic process influenced by training conditions, with the model's ability to retain knowledge decreasing over time. The study provides insights into the mechanisms of factual knowledge acquisition and retention in LLMs, highlighting the importance of training data quality and batch size in improving model performance.This paper investigates how large language models (LLMs) acquire factual knowledge during pretraining. The study reveals that factual knowledge acquisition occurs through the gradual accumulation of small probability increases from training data, but this is diluted by subsequent forgetting. Key findings include: (1) Pretraining on more data does not significantly improve factual knowledge acquisition. (2) There is a power-law relationship between training steps and forgetting of factual knowledge, with duplicated training data leading to faster forgetting. (3) Larger batch sizes enhance the models' robustness to forgetting. (4) The effectivity of factual knowledge acquisition is higher in larger models. (5) Deduplication of training data and larger batch sizes improve factual knowledge retention. The study also explains recent behaviors of LLMs, such as poor performance on long-tail knowledge and the importance of deduplication. The results suggest that factual knowledge acquisition in LLMs is a dynamic process influenced by training conditions, with the model's ability to retain knowledge decreasing over time. The study provides insights into the mechanisms of factual knowledge acquisition and retention in LLMs, highlighting the importance of training data quality and batch size in improving model performance.