Scaling and renormalization in high-dimensional regression

Scaling and renormalization in high-dimensional regression

June 27, 2024 | Alexander Atanasov, Jacob A. Zavatone-Veth, Cengiz Pehlevan
This paper presents a derivation of the training and generalization performance of high-dimensional ridge regression models using random matrix theory and free probability. It reviews recent results in these areas for readers with backgrounds in physics and deep learning. Analytic formulas for training and generalization errors are derived from the S-transform of free probability, enabling the identification of power-law scaling in model performance. The generalization error of random feature models is computed, showing that the S-transform corresponds to the train-test generalization gap and yields a generalized cross-validation estimator. The paper derives fine-grained bias-variance decompositions for random feature models with structured covariates, revealing a scaling regime where feature variance limits performance in overparameterized settings. It also demonstrates how anisotropic weight structures can limit performance and lead to nontrivial exponents for finite-width corrections. The results unify and extend earlier models of neural scaling laws. The paper introduces three key principles: Gaussian universality, deterministic equivalence, and the S-transform. Gaussian universality states that high-dimensional linear regression covariance matrices resemble Gaussian matrices. Deterministic equivalence allows replacing sample covariance with population covariance in algebraic expressions. The S-transform characterizes the spectral properties of matrix products and is used to renormalize ridge parameters. These principles are applied to derive scaling laws for linear and kernel ridge regression, showing that the S-transform relates to the train-test generalization gap and provides a simple interpretation of self-consistent equations for generalization error. The paper applies these principles to random feature models, deriving generalization error formulas and fine-grained bias-variance decompositions. It shows that the S-transform can be used to derive scaling laws for random feature models with and without feature noise, revealing a variance-dominated scaling regime. The paper also extends these results to models with additive feature noise, showing that nonlinearity can be treated as additive noise on features. The results provide a unified perspective on neural scaling laws and highlight the role of random matrix theory in understanding model performance in high-dimensional settings.This paper presents a derivation of the training and generalization performance of high-dimensional ridge regression models using random matrix theory and free probability. It reviews recent results in these areas for readers with backgrounds in physics and deep learning. Analytic formulas for training and generalization errors are derived from the S-transform of free probability, enabling the identification of power-law scaling in model performance. The generalization error of random feature models is computed, showing that the S-transform corresponds to the train-test generalization gap and yields a generalized cross-validation estimator. The paper derives fine-grained bias-variance decompositions for random feature models with structured covariates, revealing a scaling regime where feature variance limits performance in overparameterized settings. It also demonstrates how anisotropic weight structures can limit performance and lead to nontrivial exponents for finite-width corrections. The results unify and extend earlier models of neural scaling laws. The paper introduces three key principles: Gaussian universality, deterministic equivalence, and the S-transform. Gaussian universality states that high-dimensional linear regression covariance matrices resemble Gaussian matrices. Deterministic equivalence allows replacing sample covariance with population covariance in algebraic expressions. The S-transform characterizes the spectral properties of matrix products and is used to renormalize ridge parameters. These principles are applied to derive scaling laws for linear and kernel ridge regression, showing that the S-transform relates to the train-test generalization gap and provides a simple interpretation of self-consistent equations for generalization error. The paper applies these principles to random feature models, deriving generalization error formulas and fine-grained bias-variance decompositions. It shows that the S-transform can be used to derive scaling laws for random feature models with and without feature noise, revealing a variance-dominated scaling regime. The paper also extends these results to models with additive feature noise, showing that nonlinearity can be treated as additive noise on features. The results provide a unified perspective on neural scaling laws and highlight the role of random matrix theory in understanding model performance in high-dimensional settings.
Reach us at info@study.space
[slides and audio] Scaling and renormalization in high-dimensional regression