Decoupled Weight Decay Regularization

Decoupled Weight Decay Regularization

4 Jan 2019 | Ilya Loshchilov & Frank Hutter
The paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter addresses the inequivalence between L₂ regularization and weight decay regularization in adaptive gradient algorithms like Adam. They propose a modification to decouple weight decay from the optimization steps taken with respect to the loss function, which they call "decoupled weight decay." This modification is shown to improve the generalization performance of Adam, making it competitive with SGD with momentum on image classification datasets. The authors provide empirical evidence that decoupled weight decay decouples the optimal choice of weight decay factor from the learning rate setting and significantly improves Adam's generalization performance. The proposed method has been adopted by many researchers and implemented in TensorFlow and PyTorch. The paper also discusses the theoretical justification of decoupled weight decay within a Bayesian filtering framework and presents experimental results demonstrating the effectiveness of the proposed method.The paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter addresses the inequivalence between L₂ regularization and weight decay regularization in adaptive gradient algorithms like Adam. They propose a modification to decouple weight decay from the optimization steps taken with respect to the loss function, which they call "decoupled weight decay." This modification is shown to improve the generalization performance of Adam, making it competitive with SGD with momentum on image classification datasets. The authors provide empirical evidence that decoupled weight decay decouples the optimal choice of weight decay factor from the learning rate setting and significantly improves Adam's generalization performance. The proposed method has been adopted by many researchers and implemented in TensorFlow and PyTorch. The paper also discusses the theoretical justification of decoupled weight decay within a Bayesian filtering framework and presents experimental results demonstrating the effectiveness of the proposed method.
Reach us at info@study.space
[slides and audio] Fixing Weight Decay Regularization in Adam