[slides] Towards Understanding Convergence and Generalization of AdamW

This paper investigates the convergence and generalization properties of AdamW, an adaptive gradient algorithm that incorporates a decoupled weight decay term. The authors prove that AdamW converges to a stationary point, minimizing a dynamically regularized loss that combines the vanilla loss and a regularization term induced by the decoupled weight decay. This dynamic regularization differs from the static $\ell_2$-regularization used in methods like Adam and $\ell_2$-Adam, leading to distinct behaviors. The paper also establishes the stochastic gradient complexity of AdamW for finding a stationary point on general nonconvex problems and PL-conditioned problems, showing improvements over previous analyses. Additionally, the authors theoretically demonstrate that AdamW achieves smaller generalization errors compared to Adam and $\ell_2$-Adam from a Bayesian posterior perspective. Experimental results on ImageNet validate these theoretical findings, showing that AdamW outperforms $\ell_2$-Adam in terms of convergence speed, generalization error, and test accuracy.This paper investigates the convergence and generalization properties of AdamW, an adaptive gradient algorithm that incorporates a decoupled weight decay term. The authors prove that AdamW converges to a stationary point, minimizing a dynamically regularized loss that combines the vanilla loss and a regularization term induced by the decoupled weight decay. This dynamic regularization differs from the static $\ell_2$-regularization used in methods like Adam and $\ell_2$-Adam, leading to distinct behaviors. The paper also establishes the stochastic gradient complexity of AdamW for finding a stationary point on general nonconvex problems and PL-conditioned problems, showing improvements over previous analyses. Additionally, the authors theoretically demonstrate that AdamW achieves smaller generalization errors compared to Adam and $\ell_2$-Adam from a Bayesian posterior perspective. Experimental results on ImageNet validate these theoretical findings, showing that AdamW outperforms $\ell_2$-Adam in terms of convergence speed, generalization error, and test accuracy.

Towards understanding convergence and generalization of AdamW

2024 | Pan ZHOU, Xingyu XIE, Zhouchen LIN, Shuicheng YAN