2024 | Haowei Lin * 1 2 Baizhou Huang * 2 Haotian Ye * 3 Qinyu Chen 2 Zihao Wang 1 2 Sujian Li 2 Jianzhu Ma 4 Xiaojun Wan 2 James Zou 3 Yitao Liang 1 2
This paper addresses the challenge of selecting the most appropriate pre-trained model for fine-tuning in the context of large language models (LLMs). Given the vast number of available models and limited resources, the authors propose a novel framework that predicts fine-tuning performance and connects it to the Scaling Law. Unlike pre-training, the fine-tuning scaling curve includes both the "power phase" and the previously unobserved "pre-power phase." The authors introduce the concept of "pre-learned data size" to explain this phase transition and develop a Rectified Scaling Law that better fits experimental results. Based on this law, they propose an efficient LLM selection algorithm called "Accept then Stop" (ATS), which selects near-optimal models with significantly reduced resource consumption. The ATS algorithm outperforms baseline methods in terms of Pearson correlation coefficient and relative accuracy, demonstrating its robustness and efficiency. The paper also discusses the limitations and future directions, emphasizing the importance of collaborative and decentralized research in scaling law studies.This paper addresses the challenge of selecting the most appropriate pre-trained model for fine-tuning in the context of large language models (LLMs). Given the vast number of available models and limited resources, the authors propose a novel framework that predicts fine-tuning performance and connects it to the Scaling Law. Unlike pre-training, the fine-tuning scaling curve includes both the "power phase" and the previously unobserved "pre-power phase." The authors introduce the concept of "pre-learned data size" to explain this phase transition and develop a Rectified Scaling Law that better fits experimental results. Based on this law, they propose an efficient LLM selection algorithm called "Accept then Stop" (ATS), which selects near-optimal models with significantly reduced resource consumption. The ATS algorithm outperforms baseline methods in terms of Pearson correlation coefficient and relative accuracy, demonstrating its robustness and efficiency. The paper also discusses the limitations and future directions, emphasizing the importance of collaborative and decentralized research in scaling law studies.