Investigating Continual Pretraining in Large Language Models: Insights and Implications

Investigating Continual Pretraining in Large Language Models: Insights and Implications

12 Feb 2025 | Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis
This paper investigates continual pretraining in large language models (LLMs), focusing on continual domain-adaptive pretraining, which enables LLMs to integrate new information from various domains while retaining previously learned knowledge. The study introduces a new benchmark to measure the adaptability of LLMs to changing pretraining data landscapes. It also examines the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect knowledge transfer within these models. Key findings include: (i) continual pretraining consistently improves <1.5B models and is superior to domain adaptation; (ii) larger models achieve better perplexity than smaller ones when continually pretrained on the same corpus; (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting; (iv) continual pretraining boosts downstream task performance of GPT-2 family; (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity, while randomizing training domains leads to better transfer and final performance otherwise. The study evaluates the effectiveness of continual learning by measuring perplexity on L2 domain test sets and analyzing forward and backward knowledge transfer. It also investigates how the order of training domains influences performance, finding that randomizing the order of training domains enables positive transfer and reduced forgetting. The research highlights the importance of model size, noting that larger models generally perform better but are less sensitive to continual pretraining. The study also shows that continual pretraining improves downstream task performance, particularly for GPT-2 models, and that the order of training domains significantly affects the model's ability to retain and transfer knowledge. The paper concludes that continual pretraining is a promising approach for LLMs, enabling them to adapt to new domains without exhaustive retraining. It establishes a new benchmark for continual learning in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains. The findings suggest that continual pretraining is a valuable strategy for improving LLM performance, particularly in dynamic environments where new information is constantly emerging.This paper investigates continual pretraining in large language models (LLMs), focusing on continual domain-adaptive pretraining, which enables LLMs to integrate new information from various domains while retaining previously learned knowledge. The study introduces a new benchmark to measure the adaptability of LLMs to changing pretraining data landscapes. It also examines the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect knowledge transfer within these models. Key findings include: (i) continual pretraining consistently improves <1.5B models and is superior to domain adaptation; (ii) larger models achieve better perplexity than smaller ones when continually pretrained on the same corpus; (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting; (iv) continual pretraining boosts downstream task performance of GPT-2 family; (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity, while randomizing training domains leads to better transfer and final performance otherwise. The study evaluates the effectiveness of continual learning by measuring perplexity on L2 domain test sets and analyzing forward and backward knowledge transfer. It also investigates how the order of training domains influences performance, finding that randomizing the order of training domains enables positive transfer and reduced forgetting. The research highlights the importance of model size, noting that larger models generally perform better but are less sensitive to continual pretraining. The study also shows that continual pretraining improves downstream task performance, particularly for GPT-2 models, and that the order of training domains significantly affects the model's ability to retain and transfer knowledge. The paper concludes that continual pretraining is a promising approach for LLMs, enabling them to adapt to new domains without exhaustive retraining. It establishes a new benchmark for continual learning in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains. The findings suggest that continual pretraining is a valuable strategy for improving LLM performance, particularly in dynamic environments where new information is constantly emerging.
Reach us at info@study.space