27 Jun 2024 | Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang shen
Efficient Continual Pre-training by Mitigating the Stability Gap
Continual pre-training has become a key approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, leading to a shift in the training distribution. The study observes a temporary performance drop at the beginning of continual pre-training, followed by a recovery phase, known as the "stability gap." This phenomenon, previously noted in vision models, leads to inefficient pre-training and forgetting of general task knowledge. To address this, the authors propose three strategies: (1) Continually pre-training the LLM on a subset of the corpus for multiple epochs, (2) Pretraining on high-quality sub-corpus to boost domain performance, and (3) Using a data mixture similar to pre-training data to reduce distribution gaps. These strategies are validated on Llama-family models, improving medical task performance and general task performance without forgetting. The Llama-3-Physician model achieves strong performance on medical benchmarks, comparable to or better than GPT-4. The study also explores the stability gap in continual pre-training, explaining it through plasticity and stability gradients. The proposed strategies effectively mitigate the stability gap, enhancing LLM performance and reducing computational costs. The models are released for further research.Efficient Continual Pre-training by Mitigating the Stability Gap
Continual pre-training has become a key approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, leading to a shift in the training distribution. The study observes a temporary performance drop at the beginning of continual pre-training, followed by a recovery phase, known as the "stability gap." This phenomenon, previously noted in vision models, leads to inefficient pre-training and forgetting of general task knowledge. To address this, the authors propose three strategies: (1) Continually pre-training the LLM on a subset of the corpus for multiple epochs, (2) Pretraining on high-quality sub-corpus to boost domain performance, and (3) Using a data mixture similar to pre-training data to reduce distribution gaps. These strategies are validated on Llama-family models, improving medical task performance and general task performance without forgetting. The Llama-3-Physician model achieves strong performance on medical benchmarks, comparable to or better than GPT-4. The study also explores the stability gap in continual pre-training, explaining it through plasticity and stability gradients. The proposed strategies effectively mitigate the stability gap, enhancing LLM performance and reducing computational costs. The models are released for further research.