2024 | Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan
This paper presents a dynamical model of neural scaling laws, analyzing how the performance of neural networks scales with training time, model size, and dataset size. The model is based on a random feature model trained with gradient descent, which reproduces key observations about neural scaling laws. The model predicts that the scaling of performance with training time and model size has different power law exponents, leading to an asymmetric compute-optimal scaling rule where training steps increase faster than model parameters. It also explains how the gap between training and test loss can gradually build up over time due to repeated reuse of data. The model shows that early in training, networks converge to their infinite-width dynamics at a rate 1/width, but at late times, the rate becomes width^{-c}, where c depends on the architecture and task. The model also demonstrates that larger models can train faster, and that ensembling is not always compute optimal. The paper provides a theoretical framework for understanding the dynamics of training and generalization in neural networks, and shows how these dynamics depend on the structure of the data and the architecture of the network. The results are validated on various tasks, including image and language tasks, and show that the model's predictions align with empirical observations. The paper also discusses the implications of these findings for the design of neural networks and the optimization of training processes.This paper presents a dynamical model of neural scaling laws, analyzing how the performance of neural networks scales with training time, model size, and dataset size. The model is based on a random feature model trained with gradient descent, which reproduces key observations about neural scaling laws. The model predicts that the scaling of performance with training time and model size has different power law exponents, leading to an asymmetric compute-optimal scaling rule where training steps increase faster than model parameters. It also explains how the gap between training and test loss can gradually build up over time due to repeated reuse of data. The model shows that early in training, networks converge to their infinite-width dynamics at a rate 1/width, but at late times, the rate becomes width^{-c}, where c depends on the architecture and task. The model also demonstrates that larger models can train faster, and that ensembling is not always compute optimal. The paper provides a theoretical framework for understanding the dynamics of training and generalization in neural networks, and shows how these dynamics depend on the structure of the data and the architecture of the network. The results are validated on various tasks, including image and language tasks, and show that the model's predictions align with empirical observations. The paper also discusses the implications of these findings for the design of neural networks and the optimization of training processes.