Deconstructing What Makes a Good Optimizer for Language Models

Deconstructing What Makes a Good Optimizer for Language Models

10 Jul 2024 | Rosie Zhao*, Depen Morwani*, David Brandfonbrener*, Nikhil Vyas*, Sham Kakade
This paper investigates the effectiveness of various optimization algorithms for autoregressive language modeling, including SGD, Adafactor, Adam, and Lion. The study compares these algorithms across different model sizes, hyperparameters, and architectures. The results show that, except for SGD, these algorithms perform similarly in terms of optimal performance and stability across a wide range of hyperparameter choices. The choice of optimizer can be guided by practical considerations such as memory constraints and ease of implementation, as no single algorithm emerges as a clear winner in terms of performance or stability. The paper further examines simplified versions of Adam, such as Signum and Adalayer. Signum recovers the performance and stability of Adam, while Adalayer, a layerwise variant of Adam, shows that the largest impact of Adam's preconditioning is restricted to the last layer and LayerNorm parameters. The remaining layers can be trained with SGD. The study also explores the role of preconditioning in different network parameters. It is found that adaptivity is only necessary for the last layer and LayerNorm parameters, while the remaining parameters can be trained with SGD. This suggests that for most language models, SGD can be used effectively for training, with only the last layer and LayerNorm parameters requiring adaptive optimization. The results indicate that the choice of optimizer is not as critical as previously thought, and that practical considerations such as memory constraints and ease of implementation should guide the selection of an optimizer.This paper investigates the effectiveness of various optimization algorithms for autoregressive language modeling, including SGD, Adafactor, Adam, and Lion. The study compares these algorithms across different model sizes, hyperparameters, and architectures. The results show that, except for SGD, these algorithms perform similarly in terms of optimal performance and stability across a wide range of hyperparameter choices. The choice of optimizer can be guided by practical considerations such as memory constraints and ease of implementation, as no single algorithm emerges as a clear winner in terms of performance or stability. The paper further examines simplified versions of Adam, such as Signum and Adalayer. Signum recovers the performance and stability of Adam, while Adalayer, a layerwise variant of Adam, shows that the largest impact of Adam's preconditioning is restricted to the last layer and LayerNorm parameters. The remaining layers can be trained with SGD. The study also explores the role of preconditioning in different network parameters. It is found that adaptivity is only necessary for the last layer and LayerNorm parameters, while the remaining parameters can be trained with SGD. This suggests that for most language models, SGD can be used effectively for training, with only the last layer and LayerNorm parameters requiring adaptive optimization. The results indicate that the choice of optimizer is not as critical as previously thought, and that practical considerations such as memory constraints and ease of implementation should guide the selection of an optimizer.
Reach us at info@study.space
Understanding Deconstructing What Makes a Good Optimizer for Language Models