Efficient Continual Pre-training by Mitigating the Stability Gap

Efficient Continual Pre-training by Mitigating the Stability Gap

27 Jun 2024 | Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen
The paper explores the efficiency of continual pre-training for Large Language Models (LLMs) in adapting to new domains, focusing on the "stability gap" phenomenon where LLMs initially perform poorly and then recover. The authors propose three strategies to mitigate this gap: (1) pre-training on a subset of the corpus for multiple epochs, (2) selecting high-quality tokens for pre-training, and (3) using a data mixture similar to the pre-training data. These strategies are evaluated on the OpenLlama-3B model and Llama-3-8B model, showing significant improvements in medical task performance and general task performance, respectively. The Llama-3-Physician model, trained with these strategies, outperforms other open-source models and GPT-4 on medical benchmarks. The paper also discusses the impact of learning rates and training subset sizes, providing insights into the effectiveness of the proposed strategies.The paper explores the efficiency of continual pre-training for Large Language Models (LLMs) in adapting to new domains, focusing on the "stability gap" phenomenon where LLMs initially perform poorly and then recover. The authors propose three strategies to mitigate this gap: (1) pre-training on a subset of the corpus for multiple epochs, (2) selecting high-quality tokens for pre-training, and (3) using a data mixture similar to the pre-training data. These strategies are evaluated on the OpenLlama-3B model and Llama-3-8B model, showing significant improvements in medical task performance and general task performance, respectively. The Llama-3-Physician model, trained with these strategies, outperforms other open-source models and GPT-4 on medical benchmarks. The paper also discusses the impact of learning rates and training subset sizes, providing insights into the effectiveness of the proposed strategies.
Reach us at info@study.space
Understanding Efficient Continual Pre-training by Mitigating the Stability Gap