[slides and audio] Deconstructing What Makes a Good Optimizer for Language Models

This article investigates the effectiveness of various optimization algorithms for training autoregressive language models, including SGD, AdaFactor, Adam, and Lion. The study compares these algorithms across different model sizes, hyperparameters, and architectures to determine which optimizer performs best and is most stable. The results show that, except for SGD, these optimizers perform similarly in terms of both optimal performance and stability across a wide range of hyperparameter choices. The study also explores simplified versions of Adam, such as Signum and Adalayer, to understand the factors contributing to their performance and stability. Signum is found to recover the performance and stability of Adam, while Adalayer shows that the largest impact of Adam's preconditioning is limited to the last layer and LayerNorm parameters. The study further demonstrates that the remaining layers can be trained with SGD. The findings suggest that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability. The results challenge the prevailing notion that Adam should be the default optimizer, showing that other optimizers can perform similarly under various conditions.This article investigates the effectiveness of various optimization algorithms for training autoregressive language models, including SGD, AdaFactor, Adam, and Lion. The study compares these algorithms across different model sizes, hyperparameters, and architectures to determine which optimizer performs best and is most stable. The results show that, except for SGD, these optimizers perform similarly in terms of both optimal performance and stability across a wide range of hyperparameter choices. The study also explores simplified versions of Adam, such as Signum and Adalayer, to understand the factors contributing to their performance and stability. Signum is found to recover the performance and stability of Adam, while Adalayer shows that the largest impact of Adam's preconditioning is limited to the last layer and LayerNorm parameters. The study further demonstrates that the remaining layers can be trained with SGD. The findings suggest that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability. The results challenge the prevailing notion that Adam should be the default optimizer, showing that other optimizers can perform similarly under various conditions.

Deconstructing What Makes a Good Optimizer for Language Models

10 Jul 2024 | Rosie Zhao*, Depen Morwani*, David Brandfonbrener*, Nikhil Vyas*, Sham Kakade

10 Jul 2024 | Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, Sham Kakade