Understanding Simple and Scalable Strategies to Continually Pre-train Large Language Models

The paper "Simple and Scalable Strategies to Continually Pre-train Large Language Models" explores methods to efficiently update large language models (LLMs) on new data without retraining from scratch. The authors address the high costs and computational demands of retraining LLMs, particularly when new and high-quality datasets become available. They propose a strategy called "continual pre-training," which involves pre-training LLMs on large amounts of new data while maintaining performance on existing datasets. This approach is distinct from existing literature on continual learning due to the scale of incoming data. The key contributions of the paper include: 1. **Learning Rate Re-warming and Re-decaying**: The authors show that re-warming and re-decaying the learning rate is necessary for effective adaptation during continual pre-training. 2. **Replay of Previous Data**: They demonstrate that using a small percentage of replay (e.g., 5%) can significantly mitigate forgetting when updating models on hundreds of billions of new tokens. 3. **Combination of Techniques**: A combination of LR re-warming, LR re-decaying, and compute-equivalent replay allows continually pre-trained models to achieve similar performance to models re-trained on all data while using significantly less compute. 4. **Infinite Learning Rate Schedules**: The authors propose infinite learning rate schedules, which allow smooth transitions across datasets and help prevent optimization-related forgetting. The paper conducts a large-scale empirical study using models of different sizes (405M and 10B parameters) and distribution shifts (weak and strong). The results show that the proposed techniques effectively mitigate forgetting and improve adaptation, making continual pre-training a viable and efficient approach for updating LLMs on new data. The authors also provide guidelines for applying these techniques, including recommendations for learning rate schedules and replay percentages.The paper "Simple and Scalable Strategies to Continually Pre-train Large Language Models" explores methods to efficiently update large language models (LLMs) on new data without retraining from scratch. The authors address the high costs and computational demands of retraining LLMs, particularly when new and high-quality datasets become available. They propose a strategy called "continual pre-training," which involves pre-training LLMs on large amounts of new data while maintaining performance on existing datasets. This approach is distinct from existing literature on continual learning due to the scale of incoming data. The key contributions of the paper include: 1. **Learning Rate Re-warming and Re-decaying**: The authors show that re-warming and re-decaying the learning rate is necessary for effective adaptation during continual pre-training. 2. **Replay of Previous Data**: They demonstrate that using a small percentage of replay (e.g., 5%) can significantly mitigate forgetting when updating models on hundreds of billions of new tokens. 3. **Combination of Techniques**: A combination of LR re-warming, LR re-decaying, and compute-equivalent replay allows continually pre-trained models to achieve similar performance to models re-trained on all data while using significantly less compute. 4. **Infinite Learning Rate Schedules**: The authors propose infinite learning rate schedules, which allow smooth transitions across datasets and help prevent optimization-related forgetting. The paper conducts a large-scale empirical study using models of different sizes (405M and 10B parameters) and distribution shifts (weak and strong). The results show that the proposed techniques effectively mitigate forgetting and improve adaptation, making continual pre-training a viable and efficient approach for updating LLMs on new data. The authors also provide guidelines for applying these techniques, including recommendations for learning rate schedules and replay percentages.

Simple and Scalable Strategies to Continually Pre-train Large Language Models

26 Mar 2024 | Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish