D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

3 Jun 2024 | Haoran Que*, Jiaheng Liu*,†,1, Ge Zhang*,2,6, Chenchen Zhang†, Xingwei Qu3,6, Yinghao Ma4,6, Feiyu Duan†, Zhiqi Bai†, Jiakai Wang†, Yuanxian Zhang†, Xu Tan, Jie Fu5,6, Wenbo Su†, Jiamang Wang†, Lin Qu†, Bo Zheng†
The paper introduces the Domain-specific Continual Pre-Training Scaling Law (D-CPT Law) for optimizing the mixture ratio of general and domain-specific corpora in large language models (LLMs). The D-CPT Law aims to predict the optimal mixture ratio to enhance performance on specific downstream domains, reducing the need for costly grid-searching methods. The authors propose a parameterization that fits the D-CPT Law to data collected under various mixture ratios, model sizes, and dataset sizes. This parameterization is derived from the Chinchilla Scaling Law and includes a Domain-specific Learnable Coefficient (DLC) to handle cross-domain settings. The effectiveness and generalizability of the D-CPT Law are demonstrated through extensive experiments on six downstream domains, showing high accuracy and robustness. The paper also discusses three practical applications of the D-CPT Law: optimizing the trade-off between general and domain-specific abilities, determining the optimal mixture ratio with limited domain-specific data, and resource allocation. The Cross-Domain D-CPT Law is introduced to predict the performance of new domains using data from multiple domains, further reducing training costs. The authors conclude by highlighting the broader impacts of their work, including improved controllability and reduced environmental impact of LLMs.The paper introduces the Domain-specific Continual Pre-Training Scaling Law (D-CPT Law) for optimizing the mixture ratio of general and domain-specific corpora in large language models (LLMs). The D-CPT Law aims to predict the optimal mixture ratio to enhance performance on specific downstream domains, reducing the need for costly grid-searching methods. The authors propose a parameterization that fits the D-CPT Law to data collected under various mixture ratios, model sizes, and dataset sizes. This parameterization is derived from the Chinchilla Scaling Law and includes a Domain-specific Learnable Coefficient (DLC) to handle cross-domain settings. The effectiveness and generalizability of the D-CPT Law are demonstrated through extensive experiments on six downstream domains, showing high accuracy and robustness. The paper also discusses three practical applications of the D-CPT Law: optimizing the trade-off between general and domain-specific abilities, determining the optimal mixture ratio with limited domain-specific data, and resource allocation. The Cross-Domain D-CPT Law is introduced to predict the performance of new domains using data from multiple domains, further reducing training costs. The authors conclude by highlighting the broader impacts of their work, including improved controllability and reduced environmental impact of LLMs.
Reach us at info@study.space
Understanding D-CPT Law%3A Domain-specific Continual Pre-Training Scaling Law for Large Language Models