Understanding Scaling Laws in Linear Regression%3A Compute%2C Parameters%2C and Data

The paper investigates the scaling laws in linear regression, particularly focusing on the relationship between model size ($M$), data size ($N$), and test error. The authors derive theoretical bounds on the population risk of a linear regression model with $M$ parameters trained using one-pass stochastic gradient descent (SGD) on $N$ data points. They assume that the optimal parameter follows a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a > 1$. The key findings are: 1. **Error Decomposition**: The population risk can be decomposed into irreducible risk, approximation error, bias error, and variance error. The irreducible and approximation errors decrease as $M$ and $N$ increase, while the bias error decreases with $N$. 2. **Variance Error Dominance**: The variance error, which increases with $M$, is of higher order compared to the other errors, making it negligible in the population risk bound. 3. **Theoretical Bounds**: The authors derive matching upper and lower bounds on the population risk, showing that the expected excess risk is dominated by the sum of the approximation and bias errors. 4. **Optimal Stepsize and Data Allocation**: The optimal stepsize for SGD is found to be around 1 when the effective sample size $N_{\text{eff}}$ is small, and it can be chosen within a range when $N_{\text{eff}}$ is large. The optimal allocation of data and model sizes under a given compute budget is also discussed. 5. **Empirical Evidence**: Theoretical results are supported by empirical evidence, showing that large language models often exhibit underfitting, which reduces the variance error, aligning with the observed neural scaling laws. The paper provides a theoretical foundation for understanding the neural scaling laws and highlights the role of implicit regularization in SGD.The paper investigates the scaling laws in linear regression, particularly focusing on the relationship between model size ($M$), data size ($N$), and test error. The authors derive theoretical bounds on the population risk of a linear regression model with $M$ parameters trained using one-pass stochastic gradient descent (SGD) on $N$ data points. They assume that the optimal parameter follows a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a > 1$. The key findings are: 1. **Error Decomposition**: The population risk can be decomposed into irreducible risk, approximation error, bias error, and variance error. The irreducible and approximation errors decrease as $M$ and $N$ increase, while the bias error decreases with $N$. 2. **Variance Error Dominance**: The variance error, which increases with $M$, is of higher order compared to the other errors, making it negligible in the population risk bound. 3. **Theoretical Bounds**: The authors derive matching upper and lower bounds on the population risk, showing that the expected excess risk is dominated by the sum of the approximation and bias errors. 4. **Optimal Stepsize and Data Allocation**: The optimal stepsize for SGD is found to be around 1 when the effective sample size $N_{\text{eff}}$ is small, and it can be chosen within a range when $N_{\text{eff}}$ is large. The optimal allocation of data and model sizes under a given compute budget is also discussed. 5. **Empirical Evidence**: Theoretical results are supported by empirical evidence, showing that large language models often exhibit underfitting, which reduces the variance error, aligning with the observed neural scaling laws. The paper provides a theoretical foundation for understanding the neural scaling laws and highlights the role of implicit regularization in SGD.

Scaling Laws in Linear Regression: Compute, Parameters, and Data

June 13, 2024 | Licong Lin*, Jingfeng Wu†, Sham M. Kakade‡, Peter L. Bartlett§, Jason D. Lee†