[slides] Scaling Exponents Across Parameterizations and Optimizers

This paper explores the scaling of neural network models from small to large widths, focusing on the precise adjustment of algorithmic and architectural details such as parameterization and optimizer choices. The authors propose a new perspective on parameterization by investigating the alignment between parameters and data, derive theoretical results under weaker assumptions, and evaluate these across a broad set of optimizers and parameterizations. They conduct extensive empirical investigations involving tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, learning rates, and model sizes up to 26.8 billion parameters. Key findings include: 1. **Parameterization and Alignment**: The best learning rate scaling prescriptions often would have been excluded by prior assumptions about alignment between parameters and data. 2. **Hyperparameter Transfer**: All parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer. 3. **Per-Layer Learning Rates**: A novel per-layer learning rate prescription for standard parameterization outperforms muP. 4. **Epsilon Parameter in Adam**: The epsilon parameter in Adam must be scaled correctly to avoid gradient underflow, leading to the proposal of *Adamatan2*, a numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter. The paper also introduces a metric for alignment, the *alignment ratio*, which measures the contribution of alignment to activation scales during training. Empirical results show that alignment is dynamic and varies significantly throughout training, with different patterns across layers and parameterizations. The authors conclude that existing theory may be overly conservative, and their findings provide a more principled approach to understanding and optimizing model scaling.This paper explores the scaling of neural network models from small to large widths, focusing on the precise adjustment of algorithmic and architectural details such as parameterization and optimizer choices. The authors propose a new perspective on parameterization by investigating the alignment between parameters and data, derive theoretical results under weaker assumptions, and evaluate these across a broad set of optimizers and parameterizations. They conduct extensive empirical investigations involving tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, learning rates, and model sizes up to 26.8 billion parameters. Key findings include: 1. **Parameterization and Alignment**: The best learning rate scaling prescriptions often would have been excluded by prior assumptions about alignment between parameters and data. 2. **Hyperparameter Transfer**: All parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer. 3. **Per-Layer Learning Rates**: A novel per-layer learning rate prescription for standard parameterization outperforms muP. 4. **Epsilon Parameter in Adam**: The epsilon parameter in Adam must be scaled correctly to avoid gradient underflow, leading to the proposal of *Adamatan2*, a numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter. The paper also introduces a metric for alignment, the *alignment ratio*, which measures the contribution of alignment to activation scales during training. Empirical results show that alignment is dynamic and varies significantly throughout training, with different patterns across layers and parameterizations. The authors conclude that existing theory may be overly conservative, and their findings provide a more principled approach to understanding and optimizing model scaling.

Scaling Exponents Across Parameterizations and Optimizers

2024 | Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington