May 16, 2024 | Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You
This paper aims to replicate the scaling law estimation methods proposed by Hoffmann et al. (2022) for training transformer language models under a given compute budget. Specifically, the authors focus on the third method, which involves fitting a parametric loss function to reconstructed data from Hoffmann et al.'s plots. They find that the reported estimates from this method are inconsistent with the first two methods, fail to fit the extracted data well, and produce implausibly narrow confidence intervals. The issues are attributed to two main factors: the optimizer used by Hoffmann et al. stopping before convergence due to a poor choice of loss scale, and the rounding of parameter estimates in the paper, leading to substantial bias in the predictions.
The authors reconstruct the dataset from Hoffmann et al. and fit a parametric function to model the final pre-training loss. Their results show that the estimated model differs significantly from the fit reported by Hoffmann et al., and their fit fails to adequately describe the reconstructed data. They demonstrate that the confidence intervals reported by Hoffmann et al. are implausibly tight and unlikely to be obtained from proper statistical procedures given the size of their dataset. Additionally, their fit is inconsistent with the scaling policies derived from other approaches and with the scaling policy suggested by their own fit.
The paper also discusses the implications of these findings, highlighting the importance of robust and reproducible parameter estimates in influential research. The authors conclude that their parameter and standard error estimates make the third approach consistent with the findings from the first two approaches, both in terms of point estimates and standard errors. They emphasize the need for more precise and accurate parameter estimates to establish the relevant relationship in compute-optimal scaling.This paper aims to replicate the scaling law estimation methods proposed by Hoffmann et al. (2022) for training transformer language models under a given compute budget. Specifically, the authors focus on the third method, which involves fitting a parametric loss function to reconstructed data from Hoffmann et al.'s plots. They find that the reported estimates from this method are inconsistent with the first two methods, fail to fit the extracted data well, and produce implausibly narrow confidence intervals. The issues are attributed to two main factors: the optimizer used by Hoffmann et al. stopping before convergence due to a poor choice of loss scale, and the rounding of parameter estimates in the paper, leading to substantial bias in the predictions.
The authors reconstruct the dataset from Hoffmann et al. and fit a parametric function to model the final pre-training loss. Their results show that the estimated model differs significantly from the fit reported by Hoffmann et al., and their fit fails to adequately describe the reconstructed data. They demonstrate that the confidence intervals reported by Hoffmann et al. are implausibly tight and unlikely to be obtained from proper statistical procedures given the size of their dataset. Additionally, their fit is inconsistent with the scaling policies derived from other approaches and with the scaling policy suggested by their own fit.
The paper also discusses the implications of these findings, highlighting the importance of robust and reproducible parameter estimates in influential research. The authors conclude that their parameter and standard error estimates make the third approach consistent with the findings from the first two approaches, both in terms of point estimates and standard errors. They emphasize the need for more precise and accurate parameter estimates to establish the relevant relationship in compute-optimal scaling.