Scaling Exponents Across Parameterizations and Optimizers

Scaling Exponents Across Parameterizations and Optimizers

2024 | Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington
This paper investigates the scaling exponents of neural network parameterizations and optimizers, focusing on how different parameterizations and optimizers affect the learning rate scaling and stability of models. The authors propose a new perspective on parameterization by examining the alignment between parameters and data, and derive new theoretical results under weaker assumptions and a broader set of optimizers. Their extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. They find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Their results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, their novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, they demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adamatan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely. The paper also discusses the theoretical contributions, including the definition of a general space of width-scaling parameterizations, the derivation of stability and nontriviality constraints, and the proposal of a new metric for alignment. The authors show that the alignment assumptions in prior work may be overly conservative, and that the best learning rate scaling prescriptions may not be captured by these assumptions. They also demonstrate that the alignment between parameters and data is dynamic and varies throughout training, and that this has important implications for the learning rate scaling and stability of models. The paper concludes with a series of experiments that validate their theoretical findings and show that per-layer learning rate exponents can significantly improve model performance.This paper investigates the scaling exponents of neural network parameterizations and optimizers, focusing on how different parameterizations and optimizers affect the learning rate scaling and stability of models. The authors propose a new perspective on parameterization by examining the alignment between parameters and data, and derive new theoretical results under weaker assumptions and a broader set of optimizers. Their extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. They find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Their results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, their novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, they demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adamatan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely. The paper also discusses the theoretical contributions, including the definition of a general space of width-scaling parameterizations, the derivation of stability and nontriviality constraints, and the proposal of a new metric for alignment. The authors show that the alignment assumptions in prior work may be overly conservative, and that the best learning rate scaling prescriptions may not be captured by these assumptions. They also demonstrate that the alignment between parameters and data is dynamic and varies throughout training, and that this has important implications for the learning rate scaling and stability of models. The paper concludes with a series of experiments that validate their theoretical findings and show that per-layer learning rate exponents can significantly improve model performance.
Reach us at info@study.space