[slides and audio] 4%2B3 Phases of Compute-Optimal Neural Scaling Laws

The paper analyzes a three-parameter model, Power-law Random Features (PLRF), to derive compute-optimal scaling laws for neural networks. The model considers data complexity, target complexity, and model parameter count. Using one-pass stochastic gradient descent (SGD) with a mean squared loss, the authors derive a deterministic equivalent for the expected loss, capturing the training dynamics. They identify four distinct phases in the data-complexity/target-complexity phase plane, each characterized by different compute-optimal curve behaviors. These phases are determined by the relative importance of model capacity, feature embedding, and SGD noise. The paper also derives scaling-law exponents and optimal model parameter counts for each phase, providing a comprehensive understanding of how to optimize model size given a fixed compute budget and unlimited data. Additionally, the authors observe a universal scaling behavior in a large portion of the phase plane, where the optimal model parameter count is independent of the data and target complexities.The paper analyzes a three-parameter model, Power-law Random Features (PLRF), to derive compute-optimal scaling laws for neural networks. The model considers data complexity, target complexity, and model parameter count. Using one-pass stochastic gradient descent (SGD) with a mean squared loss, the authors derive a deterministic equivalent for the expected loss, capturing the training dynamics. They identify four distinct phases in the data-complexity/target-complexity phase plane, each characterized by different compute-optimal curve behaviors. These phases are determined by the relative importance of model capacity, feature embedding, and SGD noise. The paper also derives scaling-law exponents and optimal model parameter counts for each phase, providing a comprehensive understanding of how to optimize model size given a fixed compute budget and unlimited data. Additionally, the authors observe a universal scaling behavior in a large portion of the phase plane, where the optimal model parameter count is independent of the data and target complexities.

4+3 Phases of Compute-Optimal Neural Scaling Laws

23 May 2024 | Elliot Paquette*, Courtney Paquette, Lechao Xiao†, Jeffrey Pennington†