23 Jun 2024 | Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan
This paper presents a dynamical model to analyze neural scaling laws, which describe how the performance of neural networks improves with training time, dataset size, and model size. The model, based on a random feature model trained with gradient descent, captures several key phenomena observed in deep learning. It predicts different power-law exponents for the scaling of performance with training time and model size, leading to an asymmetric compute-optimal scaling rule where the number of training steps increases faster than the number of model parameters. The model also explains why larger models train faster and how the gap between training and test loss can gradually widen over time due to repeated data reuse. The theory is validated through simulations and applied to realistic datasets, showing excellent agreement with observed scaling laws. The paper concludes by discussing the implications of these findings for understanding and optimizing deep learning systems.This paper presents a dynamical model to analyze neural scaling laws, which describe how the performance of neural networks improves with training time, dataset size, and model size. The model, based on a random feature model trained with gradient descent, captures several key phenomena observed in deep learning. It predicts different power-law exponents for the scaling of performance with training time and model size, leading to an asymmetric compute-optimal scaling rule where the number of training steps increases faster than the number of model parameters. The model also explains why larger models train faster and how the gap between training and test loss can gradually widen over time due to repeated data reuse. The theory is validated through simulations and applied to realistic datasets, showing excellent agreement with observed scaling laws. The paper concludes by discussing the implications of these findings for understanding and optimizing deep learning systems.