Scaling Laws in Linear Regression: Compute, Parameters, and Data

Scaling Laws in Linear Regression: Compute, Parameters, and Data

June 13, 2024 | Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee
This paper studies the scaling laws in linear regression, focusing on how the test error of a model depends on the model size (M) and data size (N). The authors analyze the test error in an infinite-dimensional linear regression setup, where the model is trained using one-pass stochastic gradient descent (SGD) on sketched data. They assume that the data covariance matrix has a power-law spectrum of degree a > 1 and that the optimal parameters follow a Gaussian prior. They show that the reducible part of the test error is Θ(M^{-(a-1)} + N^{-(a-1)/a}), and that the variance error, which increases with M, is dominated by other errors due to the implicit regularization of SGD, thus disappearing from the bound. Their theory is consistent with empirical neural scaling laws and verified by numerical simulations. The paper also discusses the discrepancy between empirical neural scaling laws and standard statistical learning theory. While the former suggests that the population risk decreases as M and N increase, the latter predicts an increase in variance error with M. The authors explain this discrepancy by showing that the variance error is of higher order and thus unobservable when fitting the risk as a function of M and N. They further generalize their results to different scenarios, including constant stepsize SGD with iterate averaging, anisotropic priors, and logarithmic power laws. Empirical evidence supports the theory, showing that the clean neural scaling law observed in practice is due to the disappearance of variance error caused by strong regularization. The paper also discusses the implications of multiple passes over the data, which can increase the variance error and thus disrupt the clean scaling law. The authors conclude that the empirical neural scaling law is a simplification of the statistical learning bound in a regime with strong regularization.This paper studies the scaling laws in linear regression, focusing on how the test error of a model depends on the model size (M) and data size (N). The authors analyze the test error in an infinite-dimensional linear regression setup, where the model is trained using one-pass stochastic gradient descent (SGD) on sketched data. They assume that the data covariance matrix has a power-law spectrum of degree a > 1 and that the optimal parameters follow a Gaussian prior. They show that the reducible part of the test error is Θ(M^{-(a-1)} + N^{-(a-1)/a}), and that the variance error, which increases with M, is dominated by other errors due to the implicit regularization of SGD, thus disappearing from the bound. Their theory is consistent with empirical neural scaling laws and verified by numerical simulations. The paper also discusses the discrepancy between empirical neural scaling laws and standard statistical learning theory. While the former suggests that the population risk decreases as M and N increase, the latter predicts an increase in variance error with M. The authors explain this discrepancy by showing that the variance error is of higher order and thus unobservable when fitting the risk as a function of M and N. They further generalize their results to different scenarios, including constant stepsize SGD with iterate averaging, anisotropic priors, and logarithmic power laws. Empirical evidence supports the theory, showing that the clean neural scaling law observed in practice is due to the disappearance of variance error caused by strong regularization. The paper also discusses the implications of multiple passes over the data, which can increase the variance error and thus disrupt the clean scaling law. The authors conclude that the empirical neural scaling law is a simplification of the statistical learning bound in a regime with strong regularization.
Reach us at info@study.space