[slides and audio] Language models scale reliably with over-training and on downstream tasks

The paper addresses the limitations of current scaling studies in language model training, particularly the gap between compute-optimal training and over-training practices. It introduces a testbed of 104 models with parameters ranging from 0.011B to 6.9B, trained on three data distributions. The authors fit scaling laws that predict validation loss and downstream task performance, both in terms of over-training and parameter count. Key findings include: 1. **Reliable Scaling in Over-trained Regime**: The reducible loss of models trained with more over-training follows a power law in the amount of training compute, with the scaling exponent remaining constant while the scalar changes. This allows for reliable extrapolation of validation loss for models trained with 300× more compute. 2. **Downstream Task Performance Prediction**: A power law relationship is established between language modeling perplexity and average top-1 error on downstream tasks. This relationship is used to predict average top-1 error for models trained with 20× more compute. 3. **Experimental Setup and Results**: The authors train models on various datasets and evaluate them using different metrics. They find that predictions are accurate across many settings, with relative errors within 0.7% for over-trained models and 0.05% for compute-optimal models. 4. **Limitations and Future Work**: The paper identifies limitations such as the need for extensive hyperparameter sweeps and suggests future directions for improving the scalability of scaling laws. The paper aims to provide a more reliable framework for derisking expensive training runs and predicting downstream performance, making it accessible to both researchers and practitioners.The paper addresses the limitations of current scaling studies in language model training, particularly the gap between compute-optimal training and over-training practices. It introduces a testbed of 104 models with parameters ranging from 0.011B to 6.9B, trained on three data distributions. The authors fit scaling laws that predict validation loss and downstream task performance, both in terms of over-training and parameter count. Key findings include: 1. **Reliable Scaling in Over-trained Regime**: The reducible loss of models trained with more over-training follows a power law in the amount of training compute, with the scaling exponent remaining constant while the scalar changes. This allows for reliable extrapolation of validation loss for models trained with 300× more compute. 2. **Downstream Task Performance Prediction**: A power law relationship is established between language modeling perplexity and average top-1 error on downstream tasks. This relationship is used to predict average top-1 error for models trained with 20× more compute. 3. **Experimental Setup and Results**: The authors train models on various datasets and evaluate them using different metrics. They find that predictions are accurate across many settings, with relative errors within 0.7% for over-trained models and 0.05% for compute-optimal models. 4. **Limitations and Future Work**: The paper identifies limitations such as the need for extensive hyperparameter sweeps and suggests future directions for improving the scalability of scaling laws. The paper aims to provide a more reliable framework for derisking expensive training runs and predicting downstream performance, making it accessible to both researchers and practitioners.