Language models scale reliably with over-training and on downstream tasks

Language models scale reliably with over-training and on downstream tasks

14 Jun 2024 | Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt
This paper investigates the reliability of scaling laws for large language models (LLMs) in both over-trained and downstream task settings. The authors propose a testbed of 104 models with parameter sizes ranging from 0.011B to 6.9B, trained on three different data distributions. They find that scaling laws can accurately predict the performance of larger models that undergo more over-training, as well as downstream task performance. The results show that the validation loss of a 1.4B parameter model trained on 900B tokens can be predicted with high accuracy using a scaling law that extrapolates in both the amount of over-training and the number of model parameters. Similarly, the average top-1 error on downstream tasks for a 6.9B parameter model can be predicted with high accuracy using a power law relationship between language modeling perplexity and downstream task performance. The authors also find that scaling laws fit to downstream error benefit from using more expensive models compared to those fitted for loss prediction. The results suggest that the proposed scaling laws are promising for derisking the effects of over-training and the downstream performance of scaling up training recipes. The experiments are available at https://github.com/mlfoundations/scaling.This paper investigates the reliability of scaling laws for large language models (LLMs) in both over-trained and downstream task settings. The authors propose a testbed of 104 models with parameter sizes ranging from 0.011B to 6.9B, trained on three different data distributions. They find that scaling laws can accurately predict the performance of larger models that undergo more over-training, as well as downstream task performance. The results show that the validation loss of a 1.4B parameter model trained on 900B tokens can be predicted with high accuracy using a scaling law that extrapolates in both the amount of over-training and the number of model parameters. Similarly, the average top-1 error on downstream tasks for a 6.9B parameter model can be predicted with high accuracy using a power law relationship between language modeling perplexity and downstream task performance. The authors also find that scaling laws fit to downstream error benefit from using more expensive models compared to those fitted for loss prediction. The results suggest that the proposed scaling laws are promising for derisking the effects of over-training and the downstream performance of scaling up training recipes. The experiments are available at https://github.com/mlfoundations/scaling.
Reach us at info@study.space