28 Mar 2024 | Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui
This paper proposes a method for checkpoint merging in large language model (LLM) pretraining using Bayesian optimization to determine the optimal merging weight. The method leverages checkpoints saved during pretraining, which are combined to improve pretraining efficiency and reduce computational costs. Through various experiments, the authors demonstrate that their approach can significantly enhance pretraining performance, offering nearly a free lunch in terms of resource savings. The method is robust across different domains and maintains strong generalization capabilities, even when trained on a specific held-out dataset.
The key contribution of this work is the application of Bayesian optimization to find the optimal merging weight, which is particularly effective for expensive, black-box, and derivative-free objective functions. The method involves a series of pilot experiments to explore the characteristics of checkpoint merging, including which checkpoints to merge, how many checkpoints to merge, and how to merge them. The results show that merging adjacent checkpoints can yield better performance than individual checkpoints, and that the optimal merging weight can be determined through Bayesian optimization.
The authors also investigate the impact of varying the size of the held-out dataset and the merging weight search space on the effectiveness of their method. They find that the size of the held-out dataset has minimal impact on the performance of their method, and that a narrower search space is beneficial when the performance gap between checkpoints is significant. The method is tested on various benchmark datasets, including C-Eval, CMMLU, MMLU, and GSM8K, and shows superior performance compared to existing baselines such as uniform soup, greedy soup, Fisher weighted averaging, and RegMean.
The results demonstrate that the proposed method not only improves pretraining performance but also maintains strong generalization across different domains. The method is applicable to various LLMs, including Baichuan2 and DeepSeek, and shows promising results in terms of efficiency and effectiveness. The authors conclude that their method provides a resource-efficient way to enhance LLM pretraining while maintaining the generalization capabilities of the checkpoints.This paper proposes a method for checkpoint merging in large language model (LLM) pretraining using Bayesian optimization to determine the optimal merging weight. The method leverages checkpoints saved during pretraining, which are combined to improve pretraining efficiency and reduce computational costs. Through various experiments, the authors demonstrate that their approach can significantly enhance pretraining performance, offering nearly a free lunch in terms of resource savings. The method is robust across different domains and maintains strong generalization capabilities, even when trained on a specific held-out dataset.
The key contribution of this work is the application of Bayesian optimization to find the optimal merging weight, which is particularly effective for expensive, black-box, and derivative-free objective functions. The method involves a series of pilot experiments to explore the characteristics of checkpoint merging, including which checkpoints to merge, how many checkpoints to merge, and how to merge them. The results show that merging adjacent checkpoints can yield better performance than individual checkpoints, and that the optimal merging weight can be determined through Bayesian optimization.
The authors also investigate the impact of varying the size of the held-out dataset and the merging weight search space on the effectiveness of their method. They find that the size of the held-out dataset has minimal impact on the performance of their method, and that a narrower search space is beneficial when the performance gap between checkpoints is significant. The method is tested on various benchmark datasets, including C-Eval, CMMLU, MMLU, and GSM8K, and shows superior performance compared to existing baselines such as uniform soup, greedy soup, Fisher weighted averaging, and RegMean.
The results demonstrate that the proposed method not only improves pretraining performance but also maintains strong generalization across different domains. The method is applicable to various LLMs, including Baichuan2 and DeepSeek, and shows promising results in terms of efficiency and effectiveness. The authors conclude that their method provides a resource-efficient way to enhance LLM pretraining while maintaining the generalization capabilities of the checkpoints.