4+3 Phases of Compute-Optimal Neural Scaling Laws

4+3 Phases of Compute-Optimal Neural Scaling Laws

23 May 2024 | Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington
This paper presents a detailed analysis of the compute-optimal neural scaling laws, focusing on a three-parameter model called power-law random features (PLRF). The model incorporates data complexity (α), target complexity (β), and model-parameter count (d). The study derives new predictions about the compute-limited, infinite-data scaling law regime by analyzing the training dynamics of one-pass stochastic gradient descent (SGD). The research identifies four phases (+3 subphases) in the data-complexity/target-complexity phase-plane, with phase boundaries determined by the relative importance of model capacity, optimizer noise, and feature embedding. The study also derives the scaling-law exponents in all these phases, particularly computing the optimal model-parameter count as a function of floating point operation budget. The analysis shows that for a large portion of the (α, β)-phase plane, the optimal parameter is d* = f^1/2, suggesting a regime of universal scaling behavior. The PLRF model exhibits rich behavior of compute-optimal curves/loss curves, which are qualitatively and quantitatively different depending on the strengths of the data (α) vs. target (β) complexity. The study further shows that there are four distinct (plus three subphases) compute-optimal curve/loss curve behaviors. The paper also discusses the derivation of the Volterra equation for the expected risk under SGD, which is used to analyze the learning dynamics of SGD. The study provides a detailed analysis of the Volterra equation under the deterministic equivalent, including the convergence threshold and the simplification of the Volterra equation. The paper concludes with a summary of the main contributions and findings, highlighting the importance of understanding the compute-optimal scaling laws in the context of large language models and other large-scale optimization problems.This paper presents a detailed analysis of the compute-optimal neural scaling laws, focusing on a three-parameter model called power-law random features (PLRF). The model incorporates data complexity (α), target complexity (β), and model-parameter count (d). The study derives new predictions about the compute-limited, infinite-data scaling law regime by analyzing the training dynamics of one-pass stochastic gradient descent (SGD). The research identifies four phases (+3 subphases) in the data-complexity/target-complexity phase-plane, with phase boundaries determined by the relative importance of model capacity, optimizer noise, and feature embedding. The study also derives the scaling-law exponents in all these phases, particularly computing the optimal model-parameter count as a function of floating point operation budget. The analysis shows that for a large portion of the (α, β)-phase plane, the optimal parameter is d* = f^1/2, suggesting a regime of universal scaling behavior. The PLRF model exhibits rich behavior of compute-optimal curves/loss curves, which are qualitatively and quantitatively different depending on the strengths of the data (α) vs. target (β) complexity. The study further shows that there are four distinct (plus three subphases) compute-optimal curve/loss curve behaviors. The paper also discusses the derivation of the Volterra equation for the expected risk under SGD, which is used to analyze the learning dynamics of SGD. The study provides a detailed analysis of the Volterra equation under the deterministic equivalent, including the convergence threshold and the simplification of the Volterra equation. The paper concludes with a summary of the main contributions and findings, highlighting the importance of understanding the compute-optimal scaling laws in the context of large language models and other large-scale optimization problems.
Reach us at info@futurestudyspace.com
[slides] 4%2B3 Phases of Compute-Optimal Neural Scaling Laws | StudySpace