Chinchilla Scaling: A replication attempt

Chinchilla Scaling: A replication attempt

May 16, 2024 | Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You
This paper attempts to replicate the third method proposed by Hoffmann et al. (2022) for estimating the compute-optimal scaling law. The method involves fitting a parametric loss function to data from their plots. However, the authors find that Hoffmann et al.'s reported estimates are inconsistent with their first two methods, fail to fit the extracted data, and report implausibly narrow confidence intervals. Two factors explain these findings: first, the optimizer used by Hoffmann et al. stopped before convergence due to a poor choice of loss scale, and second, the parameter estimates reported in the paper (as opposed to the TeX source) are rounded in a way that results in substantial bias in the predictions of the scaling law. In contrast, the authors' re-derivation of the scaling law using the third approach yields results compatible with the findings from the first two estimation procedures. The authors partially reconstruct the dataset from Hoffmann et al. and attempt to replicate Approach 3. This involves fitting a parametric function to model the final pre-training loss as $ L(N,D)=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}} $, where N represents the number of model parameters and D represents the number of training tokens. Their analysis reveals that their estimated model differs substantially from the fit reported in Hoffmann et al., and that their fit fails to adequately describe the reconstructed data. They demonstrate that the confidence intervals reported by Hoffmann et al. are implausibly tight and unlikely to be obtained from proper statistical procedures given the size of their dataset. Finally, they show that their fit is inconsistent with the scaling policies derived through other approaches, and with the scaling policy suggested by their fit. The authors also find that the confidence intervals reported by Hoffmann et al. are implausibly narrow, requiring over 600,000 experiments to achieve such precision, while they likely only ran fewer than 500. This is due to the early stopping of the optimizer, which caused sub-optimal parameter values. The authors' parameter estimates are more accurate and consistent with the scaling policies from Approaches 1 and 2. Their fitted model implies an optimal ratio of around 20 tokens per parameter, which is consistent with both how the Chinchilla model was trained and the findings from Approaches 1 and 2 in Hoffmann et al. The inconsistency between the prescriptions based on the estimated scaling law and the results from Approaches 1 and 2 is an artifact of the early optimizer stopping problem that produced the inaccurate parameter estimates. The only data points for which the functional form seems to perform badly in their fit are those with extremely few data points compared to their number of parameters, which are the five outliers that they have excluded from the fit that produced their main parameter estimates.This paper attempts to replicate the third method proposed by Hoffmann et al. (2022) for estimating the compute-optimal scaling law. The method involves fitting a parametric loss function to data from their plots. However, the authors find that Hoffmann et al.'s reported estimates are inconsistent with their first two methods, fail to fit the extracted data, and report implausibly narrow confidence intervals. Two factors explain these findings: first, the optimizer used by Hoffmann et al. stopped before convergence due to a poor choice of loss scale, and second, the parameter estimates reported in the paper (as opposed to the TeX source) are rounded in a way that results in substantial bias in the predictions of the scaling law. In contrast, the authors' re-derivation of the scaling law using the third approach yields results compatible with the findings from the first two estimation procedures. The authors partially reconstruct the dataset from Hoffmann et al. and attempt to replicate Approach 3. This involves fitting a parametric function to model the final pre-training loss as $ L(N,D)=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}} $, where N represents the number of model parameters and D represents the number of training tokens. Their analysis reveals that their estimated model differs substantially from the fit reported in Hoffmann et al., and that their fit fails to adequately describe the reconstructed data. They demonstrate that the confidence intervals reported by Hoffmann et al. are implausibly tight and unlikely to be obtained from proper statistical procedures given the size of their dataset. Finally, they show that their fit is inconsistent with the scaling policies derived through other approaches, and with the scaling policy suggested by their fit. The authors also find that the confidence intervals reported by Hoffmann et al. are implausibly narrow, requiring over 600,000 experiments to achieve such precision, while they likely only ran fewer than 500. This is due to the early stopping of the optimizer, which caused sub-optimal parameter values. The authors' parameter estimates are more accurate and consistent with the scaling policies from Approaches 1 and 2. Their fitted model implies an optimal ratio of around 20 tokens per parameter, which is consistent with both how the Chinchilla model was trained and the findings from Approaches 1 and 2 in Hoffmann et al. The inconsistency between the prescriptions based on the estimated scaling law and the results from Approaches 1 and 2 is an artifact of the early optimizer stopping problem that produced the inaccurate parameter estimates. The only data points for which the functional form seems to perform badly in their fit are those with extremely few data points compared to their number of parameters, which are the five outliers that they have excluded from the fit that produced their main parameter estimates.
Reach us at info@study.space
[slides] Chinchilla Scaling%3A A replication attempt | StudySpace