Understanding Fixing Weight Decay Regularization in Adam

This paper investigates the difference between $ L_2 $ regularization and weight decay regularization in adaptive gradient methods like Adam. While $ L_2 $ regularization and weight decay are equivalent for standard stochastic gradient descent (SGD), they are not equivalent for adaptive gradient methods such as Adam. The authors propose a modification to decouple weight decay from the optimization steps, leading to improved generalization performance for Adam. This decoupled weight decay allows Adam to compete with SGD with momentum on image classification tasks, where it was previously outperformed. The proposed method, called AdamW, has been widely adopted by researchers and implemented in popular deep learning frameworks like TensorFlow and PyTorch. The authors show that decoupled weight decay leads to a more separable hyperparameter space, making hyperparameter tuning easier. They also demonstrate that decoupled weight decay significantly improves generalization performance, achieving a 15% relative improvement in test error on image classification datasets. The results show that decoupled weight decay is more effective than $ L_2 $ regularization for both SGD and Adam, and that the optimal weight decay factor is independent of the learning rate. The authors also show that decoupled weight decay can be combined with learning rate scheduling techniques like cosine annealing to further improve performance. The paper concludes that decoupled weight decay is a valuable improvement for adaptive gradient methods and that it can help make Adam more competitive with SGD with momentum.This paper investigates the difference between $ L_2 $ regularization and weight decay regularization in adaptive gradient methods like Adam. While $ L_2 $ regularization and weight decay are equivalent for standard stochastic gradient descent (SGD), they are not equivalent for adaptive gradient methods such as Adam. The authors propose a modification to decouple weight decay from the optimization steps, leading to improved generalization performance for Adam. This decoupled weight decay allows Adam to compete with SGD with momentum on image classification tasks, where it was previously outperformed. The proposed method, called AdamW, has been widely adopted by researchers and implemented in popular deep learning frameworks like TensorFlow and PyTorch. The authors show that decoupled weight decay leads to a more separable hyperparameter space, making hyperparameter tuning easier. They also demonstrate that decoupled weight decay significantly improves generalization performance, achieving a 15% relative improvement in test error on image classification datasets. The results show that decoupled weight decay is more effective than $ L_2 $ regularization for both SGD and Adam, and that the optimal weight decay factor is independent of the learning rate. The authors also show that decoupled weight decay can be combined with learning rate scheduling techniques like cosine annealing to further improve performance. The paper concludes that decoupled weight decay is a valuable improvement for adaptive gradient methods and that it can help make Adam more competitive with SGD with momentum.

Decoupled Weight Decay Regularization

4 Jan 2019 | Ilya Loshchilov & Frank Hutter