26 Mar 2024 | Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish
This paper presents simple and scalable strategies for continual pre-training of large language models (LLMs), which allows models to be updated with new data without re-training from scratch. The authors demonstrate that a combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data can match the performance of full re-training on all available data, as measured by final loss and language model evaluation benchmarks. The approach is tested on two common LLM pre-training datasets: English→English (weak shift) and English→German (strong shift), with models ranging from 405M to 10B parameters. The results show that continual pre-training strategies can achieve similar performance to full re-training while using significantly less compute. The authors also propose alternative learning rate schedules that help mitigate forgetting caused by LR re-warming and are not bound to a fixed token budget. The study highlights the potential of continual learning for reducing the computational cost of updating LLMs and suggests that future work should explore larger-scale experiments and more efficient replay strategies. The paper also discusses related work in continual learning, pre-training, and domain adaptive continual pre-training, and provides a detailed methodology and experimental setup for evaluating the effectiveness of continual pre-training. The results show that replay of previous data is effective in mitigating forgetting, and that appropriate amounts of replay can help maintain performance on previous data while adapting to new data. The study concludes that simple and scalable continual learning strategies can be used to update LLMs efficiently, matching the performance of full re-training with significantly less compute.This paper presents simple and scalable strategies for continual pre-training of large language models (LLMs), which allows models to be updated with new data without re-training from scratch. The authors demonstrate that a combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data can match the performance of full re-training on all available data, as measured by final loss and language model evaluation benchmarks. The approach is tested on two common LLM pre-training datasets: English→English (weak shift) and English→German (strong shift), with models ranging from 405M to 10B parameters. The results show that continual pre-training strategies can achieve similar performance to full re-training while using significantly less compute. The authors also propose alternative learning rate schedules that help mitigate forgetting caused by LR re-warming and are not bound to a fixed token budget. The study highlights the potential of continual learning for reducing the computational cost of updating LLMs and suggests that future work should explore larger-scale experiments and more efficient replay strategies. The paper also discusses related work in continual learning, pre-training, and domain adaptive continual pre-training, and provides a detailed methodology and experimental setup for evaluating the effectiveness of continual pre-training. The results show that replay of previous data is effective in mitigating forgetting, and that appropriate amounts of replay can help maintain performance on previous data while adapting to new data. The study concludes that simple and scalable continual learning strategies can be used to update LLMs efficiently, matching the performance of full re-training with significantly less compute.