23 Jan 2020 | Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
This article investigates the empirical scaling laws for language model performance on the cross-entropy loss, focusing on the Transformer architecture. The study reveals that model performance scales as a power-law with model size, dataset size, and training compute, with trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. The performance depends strongly on scale, which includes the number of model parameters $N$, the size of the dataset $D$, and the amount of compute $C$ used for training. The results show that performance has a power-law relationship with each of these factors when not bottlenecked by the other two. The study also finds that the performance penalty depends predictably on the ratio $N^{0.74}/D$, meaning that increasing the model size 8x only requires increasing the data by roughly 5x to avoid a penalty. Training curves follow predictable power-laws whose parameters are roughly independent of the model size. Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points. Convergence is inefficient, and optimal performance is achieved by training very large models on a relatively modest amount of data and stopping significantly before convergence. The optimal allocation of the compute budget involves training very large models and using a relatively small increase in data to avoid reuse. The results suggest that larger language models will perform better and be more sample efficient than current models. The study provides a framework for predicting the performance of language models based on the scaling laws derived from the empirical results.This article investigates the empirical scaling laws for language model performance on the cross-entropy loss, focusing on the Transformer architecture. The study reveals that model performance scales as a power-law with model size, dataset size, and training compute, with trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. The performance depends strongly on scale, which includes the number of model parameters $N$, the size of the dataset $D$, and the amount of compute $C$ used for training. The results show that performance has a power-law relationship with each of these factors when not bottlenecked by the other two. The study also finds that the performance penalty depends predictably on the ratio $N^{0.74}/D$, meaning that increasing the model size 8x only requires increasing the data by roughly 5x to avoid a penalty. Training curves follow predictable power-laws whose parameters are roughly independent of the model size. Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points. Convergence is inefficient, and optimal performance is achieved by training very large models on a relatively modest amount of data and stopping significantly before convergence. The optimal allocation of the compute budget involves training very large models and using a relatively small increase in data to avoid reuse. The results suggest that larger language models will perform better and be more sample efficient than current models. The study provides a framework for predicting the performance of language models based on the scaling laws derived from the empirical results.